Chapter 5: Graphs and Maps

“The ideal situation occurs when the things that we regard as beautiful are also regarded by other people as useful.”—Donald Knuth

require(lubridate)
require(maps)
require(classInt)
require(maptools)
require(rgdal)
require(mapdata)
require(ggplot2)

Graphs and maps help you reason about your data. They also help you communicate your results. A good graph gives you the most information in the shortest time, with the least ink in the smallest space (Tufte 1997).

con = "http://hurricaneclimatology.squarespace.com/storage/chapter-5/SOI.txt"
SOI = read.table(con, header = TRUE)
con = "http://hurricaneclimatology.squarespace.com/storage/chapter-5/NAO.txt"
NAO = read.table(con, header = TRUE)
con = "http://hurricaneclimatology.squarespace.com/storage/chapter-5/SST.txt"
SST = read.table(con, header = TRUE)
con = "http://hurricaneclimatology.squarespace.com/storage/chapter-5/ATL.txt"
A = read.table(con, header = TRUE)
con = "http://hurricaneclimatology.squarespace.com/storage/chapter-5/H.txt"
US = read.table(con, header = TRUE)

Graphs

With the standard (base) graphics environment you can produce a variety of plots with fine details. Most of the figures in my book are created using the standard graphics environment. The grid graphics environment is more flexible. It allows you to design complex layouts with nested graphs where scaling is maintained when you resize the figure.

The lattice and ggplot2 packages use grid graphics to create specialized graphing functions and methods. The spplot() function is an example of a plot method built with grid graphics that we use to create maps of our spatial data. The ggplot2 package is an implementation of the grammar of graphics combining advantages from the standard and lattice graphic environments.

Box plot

A box plot is a graph of the five-number data summary (min, max, median, q25, q75). The summary() method applied to a data vector computes the mean and five other statistics including the minimum, the first quartile value, the median, the third quartile value, and the maximum. A box plot is a graph of these numbers. This is done using the boxplot() function. For example, to create a box plot of your October SOI data, type

boxplot(SOI$Oct, ylab = "October SOI (s.d.)")
ggplot(SOI, aes(x = factor(0), y = Oct)) + geom_boxplot() + ylab("October SOI (s.d.)") + 
    xlab("")

With annotation.

boxplot(SOI$Aug, ylab = "August SOI [s.d.]")
f = fivenum(SOI$Aug)
text(rep(1.25, 5), f, labels = c("minimum", "lower", "median", "upper", "maximum"), 
    adj = c(0, 0), cex = 0.75)

plot of chunk boxPlot2

The line inside the box is the median value. The bottom of the box (lower hinge) is the first quartile value and the top of the box (upper hinge) is the third quartile. The vertical line (whisker) from the top of the box extends to the maximum value and the vertical line from the bottom of the box extends to the minimum value.

Hinge values equal the quartiles exactly when there is an odd number of observations. Otherwise hinges are the middle value of the lower (or upper) half of the observations if there is an odd number of observations below the median and are the middle of two values if there is an even number of observations below the median.

The fivenum() function gives the five numbers used by boxplot(). The height of the box is essentially the interquartile range (IQR()) and the range is the distance from the minimum to the maximum.

By default the whiskers are drawn as a dashed line extending from the box to the minimum and maximum data values. Convention is to make the length of the whiskers no longer than 1.5 times the height of the box. The outliers, data values larger or smaller than this range, are marked separately with points. The text identifies the values. Here with the default options we see one data value greater than 1.5 times the interquartile range. In this case the upper whisker extends to the last data value less than 1.5 times the IQR.

For example, if you type

Q1 = fivenum(SOI$Aug)[2]
Q2 = fivenum(SOI$Aug)[3]
Q3 = fivenum(SOI$Aug)[4]
Q2 + (Q3 - Q1) * 1.5
## [1] 2.285

One observation is larger than 2.3. In this case, the upper whisker ends at the next highest observation value less than 2.3. Observations above and below the whiskers are considered outliers. You can find the value of the single outlier of the August SOI by typing

sort(SOI$Aug)
##   [1] -2.740 -2.660 -2.550 -2.440 -2.310 -2.200 -2.140 -2.110 -2.030 -1.890
##  [11] -1.840 -1.750 -1.730 -1.670 -1.620 -1.460 -1.420 -1.410 -1.410 -1.400
##  [21] -1.390 -1.370 -1.240 -1.200 -1.180 -1.130 -1.120 -1.090 -1.090 -1.080
##  [31] -1.070 -1.060 -1.050 -1.020 -1.010 -1.000 -0.970 -0.960 -0.940 -0.920
##  [41] -0.920 -0.900 -0.900 -0.900 -0.880 -0.850 -0.820 -0.750 -0.750 -0.710
##  [51] -0.710 -0.690 -0.680 -0.640 -0.640 -0.640 -0.630 -0.590 -0.590 -0.550
##  [61] -0.500 -0.430 -0.420 -0.390 -0.380 -0.370 -0.350 -0.280 -0.230 -0.210
##  [71] -0.200 -0.170 -0.160 -0.150 -0.110 -0.110 -0.110 -0.110 -0.060 -0.030
##  [81] -0.010  0.000  0.010  0.050  0.060  0.100  0.120  0.130  0.190  0.200
##  [91]  0.200  0.260  0.300  0.310  0.330  0.330  0.340  0.360  0.380  0.400
## [101]  0.440  0.440  0.480  0.480  0.530  0.560  0.600  0.640  0.660  0.660
## [111]  0.690  0.720  0.720  0.750  0.750  0.800  0.800  0.830  0.880  0.890
## [121]  0.890  0.900  0.940  0.980  0.990  1.050  1.060  1.080  1.120  1.140
## [131]  1.180  1.240  1.260  1.300  1.320  1.330  1.360  1.380  1.420  1.430
## [141]  1.510  1.771  2.030  2.200  3.480

Note the value 2.2 is the largest observation in the data less than 2.3.

Your observations are said to be symmetric if the median is near the middle of the box and the length of the two whiskers are about equal. A symmetric set of observations will also have about the same number of high and low outliers.

To summarize, 25% of all your observations are below the lower quartile (below the box), 50% are below (and above) the median, and 25% are above the upper quartile. The box contains 50% of all your data. The upper whisker extends from the upper quartile to the maximum and the lower quartile extends from the lower quartile value to the minimum except if they exceed 1.5 times the interquartile range above the upper or below the lower quartiles. In this case outliers are plotted as points. This outlier option can be turned off by setting the range to zero.

The box plot is an efficient graphical summary of your data, but it can be designed better. For example, the box sides provide redundant information. By removing the box lines altogether, the same information is available with less ink.

The figure below is series of box plots representing the SOI for each month. The dot represents the median; the ends of the lines towards the dot are the lower and upper quartile, respectively; the ends of the lines towards the bottom and top of the graph are the minimum and maximum values, respectively.

plot(c(1, 12), c(-5, 5), type = "n", xaxt = "n", bty = "n", xlab = "", ylab = "SOI [s.d.]")
axis(1, at = seq(1, 12, 1), labels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", 
    "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"), cex = 0.5)
for (i in 1:12) {
    points(i, fivenum(SOI[, i + 1])[3], pch = 19)
    lines(c(i, i), c(fivenum(SOI[, i + 1])[1], fivenum(SOI[, i + 1])[2]))
    lines(c(i, i), c(fivenum(SOI[, i + 1])[4], fivenum(SOI[, i + 1])[5]))
}

plot of chunk TufteboxplotFig

Here the space between the whisker and the point is used to show information.

Histogram

A histogram is a graph of the distribution of your observations. It shows where the values tend to cluster and where they tend to be sparse. The histogram is similar to a bar plot. The histogram uses bars to indicate frequency (or proportion) in data intervals, whereas a bar plot uses bars to indicate the frequency of data by categories. The hist() function creates a histogram.

As an example, consider NOAA's annual values of accumulated cyclone energy (ACE) for the North Atlantic and July SOI values. Annual ACE is calculated by squaring the maximum wind speed for each six hour tropical cyclone observation and summing over all cyclones in the season.

The values obtained from NOAA (http://www.aoml.noaa.gov/hrd/tcfaq/E11.html) are expressed in units of knots squared \( \times 10^4 \). You create the two histograms and plot them side-by-side. Details on plotting options are given in Murrell (2005). After your histogram is plotted the function rug() adds tick marks along the horizontal axis at the location of each observation.

hist(A$ACE)
rug(A$ACE)

plot of chunk histogramACE

hist(SOI$Jul)
rug(SOI$Jul)

plot of chunk histogramACE

Plot titles are useful for presentations, but should not be used for publication. The default horizontal axis label is the name of the data vector. The default vertical axis is frequency and is labeled accordingly.

You create a histogram using ggplot, type

ggplot(SOI, aes(Jul)) + geom_histogram()
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust
## this.

plot of chunk ggplotHist

The hist() function has lots of options. Default values for these options provide a good starting point, but you might want to make adjustments. Thus it is good to know how the histogram is assembled. First a contiguous collection of disjoint intervals, called bins (or classes), is chosen that cover the range of data values. The default for the number bins is the value $\lceil \log_2(n)+1\rceil $, where \( n \) is the sample size and \( \lceil \rceil \) indicates the ceiling value (next largest integer). If you type

n = length(SOI$Jul)
ceiling(log(n, base = 2) + 1)
## [1] 9

you can see that adjustments are made to this number so that the cut points correspond to whole number data values. In the case of ACE the adjustment results in 7 bins and in the case of the SOI it results in 11 bins. Thus the computed number of bins is a suggestion that gets modified to make for nice breaks.

The bins are contiguous and disjoints so the intervals look like \( (a, b] \) or \( [a, b) \) where the interval \( (a, b] \) means from \( a \) to \( b \) including \( b \) but not \( a \). Next, the number of data values in each of the intervals is counted. Finally, a bar is drawn above the interval so that the bar height is the number of data values (frequency). A useful argument to make your histogram more visually understandable is prob=TRUE, which allows you to set the bar height to the density, where the sum of the densities times the bar interval width equals one.

You conclude that ACE is positively skewed with a relatively few years having very large values. Clearly ACE does not follow a normal distribution. In contrast, the SOI appears quite symmetric with rather short tails as you would expect from a normal distribution.

Density plot

A histogram outlines the general shape of your data. Usually that is sufficient. You can adjust the number of bins (or bin width) to get more or less detail on the shape. An alternative is a density plot. A density plot captures the distribution shape by essentially smoothing the histogram. Instead of specifying the bin width you specify the amount (and type) of smoothing.

There are two steps to produce a density plot. First you need to use the density() function to obtain a set of kernel density estimates from your observations. Second you need to plot these estimates, typically by using the plot() method.

A kernel density is a function that provides an estimate of the average number of values at any location in the space defined by your data. This is illustrated in below where the October SOI values in the period 2005–2010 are indicated as a rug and a kernel density function is shown as the black curve. The height of the function, representing the local density, is a sum of the heights of the individual kernels shown in red.

x = tail(SOI$Oct)
xlims = c(-3, 3)
bandw = 0.5
plot(density(x, bw = bandw), xlim = xlims, type = "n", main = "", xlab = "October SOI [s.d.]")
rug(x, lwd = 3)
d = density(x[1], bw = bandw)
lines(d$x, d$y/length(x), xlim = xlims, col = "red")
d = density(x[2], bw = bandw)
lines(d$x, d$y/length(x), xlim = xlims, col = "red")
d = density(x[3], bw = bandw)
lines(d$x, d$y/length(x), xlim = xlims, col = "red")
d = density(x[4], bw = bandw)
lines(d$x, d$y/length(x), xlim = xlims, col = "red")
d = density(x[5], bw = bandw)
lines(d$x, d$y/length(x), xlim = xlims, col = "red")
d = density(x[6], bw = bandw)
lines(d$x, d$y/length(x), xlim = xlims, col = "red")
lines(density(x, bw = bandw), xlim = xlims, lwd = 2)

plot of chunk kernelDensity

The kernel is a Gaussian (normal) distribution centered at each data value. The width of the kernel, called the bandwidth, controls the amount of smoothing. The bandwidth is the standard deviation of the kernel in the density() function. This means the inflection points on the kernel are separated by one bandwidth in units of the data values. Here with the SOI in units of standard deviation the bandwidth equals .5 s.d.

A larger bandwidth produces a smoother density plot or a fixed number of observations because the kernels have greater overlap. The next figure shows the density plot of June NAO values from the period 1851–2010 using bandwidths of .1, .2, .5, and 1.

par(mfrow = c(2, 2), mar = c(5, 5, 4, 2) + 0.1, mex = 0.7, las = 1, mgp = c(3, 
    0.4, 0), tcl = -0.3)
x = NAO$Jun
xlims = range(x)
plot(density(x, bw = 0.1), xlim = xlims, main = "", xlab = "June NAO [s.d.]", 
    lwd = 2)
rug(x)
mtext("a", side = 3, line = 1, adj = 0, cex = 1.1)
plot(density(x, bw = 0.2), xlim = xlims, main = "", xlab = "June NAO [s.d.]", 
    lwd = 2)
rug(x)
mtext("b", side = 3, line = 1, adj = 0, cex = 1.1)
plot(density(x, bw = 0.5), xlim = xlims, main = "", xlab = "June NAO [s.d.]", 
    lwd = 2)
rug(x)
mtext("c", side = 3, line = 1, adj = 0, cex = 1.1)
plot(density(x, bw = 1), xlim = xlims, main = "", xlab = "June NAO [s.d.]", 
    lwd = 2)
rug(x)
mtext("d", side = 3, line = 1, adj = 0, cex = 1.1)

plot of chunk kerneldensityplot

par(mfrow = c(1, 1))

The smallest bandwidth produces a density plot that is has spikes as it captures the fine scale variability in the distribution of values. As the bandwidth increases the spikes disappear and the density gets smoother. The largest bandwidth produces a smooth symmetric density centered on the value of zero.

To create a density plot for the NAO values with a histogram overlay, type

par(las = 1, mgp = c(2.3, 0.4, 0), tcl = -0.3)
d = density(NAO$Jun, bw = 0.5)
plot(d, main = "", xlab = "June NAO [s.d.]", lwd = 2, col = "red")
hist(NAO$Jun, prob = TRUE, add = TRUE)
rug(NAO$Jun)

plot of chunk histogramDensity

The density function takes your vector of data values as input and allows you to specify a bandwidth using the bw argument. Here you are using the vector of June NAO values and a bandwidth of .5 s.d. The bandwidth units are the same as the units of your data, here s.d. for the NAO. The output is saved as a density object, here called d. The object is then plotted using the plot() method. You turn off the default plot title with the main=“” and you specify a label for the values to be plotted below the horizontal axis. You specify the line width as 2 and the line color as red.

You then add the histogram over the density line using the hist() function. You use prob=TRUE to make the bar height proportional to the density. The argument add=TRUE is needed so that the histogram plots on the same device.

A reason for plotting the histogram or density is to see whether your data have a normal distribution. The Q-Q plot provides another way to make this assessment.

Q-Q plot

A Q-Q plot graphically compares distributions. It does this by plotting quantile (Q) values of one distribution against the corresponding quantile (Q) values of the other distribution. In the case of assessing whether or not your data are normally distributed, the sample quantiles are plotted on the vertical axis and quantiles from a standard normal distribution are plotted along the horizontal axis. In this case it is called a Q-Q normal plot.

That is, the $k$th smallest observation is plotted against the expected value of the $k$th smallest random value from a \( N(0, 1) \) sample of size \( n \). The pattern of points in the plot is then used to compare your data against a normal distribution. If your data are normally distributed then the points should align along the \( y=x \) line.

This is done using the qqnorm() function. To make side-by-side Q-Q normal plots for the ACE values and the July SOI values you type

c = 0.5144^2
par(mfrow = c(1, 2), pty = "s", las = 1, mgp = c(2, 0.4, 0), tcl = -0.3)
qqnorm(A$ACE * c, main = "", ylab = expression(paste("Sample quantiles (ACE [x", 
    10^4, m^2, s^-2, "])")), xlab = "Theoretical quantiles")
qqline(A$ACE * c, col = "red")
mtext("a", side = 3, line = 1, adj = 0, cex = 1.1)
qqnorm(SOI$Jul, main = "", ylab = "Sample quantiles (SOI [s.d.])", xlab = "Theoretical quantiles")
qqline(SOI$Jul, col = "red")
mtext("b", side = 3, line = 1, adj = 0, cex = 1.1)

plot of chunk qqnormplot

par(mfrow = c(1, 1))

The quantiles are non-decreasing. The \( y=x \) line is added to the plot using the qqline() function. Additionally we adjust the vertical axis label and turn the default title off.

The plots show that July SOI values appear to have a normal distribution while the seasonal ACE does not. For observations that have a positive skew, like the ACE, the pattern of points on a Q-Q normal plot is concave upward. For observations that have a negative skew the pattern of points is concave downward. For values that have a symmetric distribution but with fatter tails than the normal (e.g., the \( t \)-distribution), the pattern of points resembles an inverse sine function.

The Q-Q normal plot is useful in checking the residuals from a regression model. The assumption is that the residuals are independent and identically distributed from a normal distribution centered on zero. Previously created a multiple linear regression model for August SST using March SST and year as the explanatory variables. To examine the assumption of normally distributed residuals with a Q-Q normal plot type

model = lm(Aug ~ Year + Mar, data = SST)
qqnorm(model$residuals)
qqline(model$residuals, col = "red")

plot of chunk qqnormResiduals

The points align along the \( y=x \) axis indicating a normal distribution.

Scatter plot

The plot() method creates a scatter plot. The values of one variable are plotted against the values of the other variable as points in a Cartesian plane. The values named in the first argument are plotted along the horizontal axis.

This pairing of two variables is useful in generating and testing hypotheses about a possible relationship. In the context of correlation which variable gets plotted on which axis is irrelevant. Either way, the scatter of points illustrates the amount of correlation. However, in the context of a statistical model, by convention the dependent variable (the variable you are interested in explaining) is plotted on the vertical axis and the explanatory variable is plotted on the horizontal axis.

For example, if your interest is whether pre-hurricane season ocean warmth (e.g., June SST) is related to ACE, your model is

ace = A$ACE * 0.5144^2
sst = SST$Jun
model = lm(ace ~ sst)

and you plot ACE on the vertical axis. Since your slope and intercept coefficients are saved as part of the object model, you first create a scatter plot then use the abline() function to add the linear regression line. Here the function extracts the intercept and slope coefficient values from the model object and draws the straight line using the point-intercept formula.

Note the model formula syntax (ace ~ sst) as the first argument in the plot() method.

ci.lwr = predict(model, data.frame(sst = sort(sst)), level = 0.95, interval = "confidence")[, 
    2]
ci.upr = predict(model, data.frame(sst = sort(sst)), level = 0.95, interval = "confidence")[, 
    3]
par(mfrow = c(1, 1), pty = "s", las = 1, mgp = c(2, 0.4, 0), tcl = -0.3)
plot(ace ~ sst, type = "n", ylab = "ACE [times 10^4 m^2 s^-2]", xlab = "SST [deg C]")
grid()
xx = c(sort(sst), rev(sort(sst)))
yy = c(ci.upr, rev(ci.lwr))
polygon(xx, yy, col = "gray", border = "gray")
points(sst, ace, pch = 19, cex = 0.8)
abline(model, col = "red", lwd = 2)

plot of chunk acesstscatterplot

The relationship between ACE and SST is summarized by the linear regression model shown by the straight line. The slope of the line indicates that for every 1$\circ\( C increase in SST the average value of ACE increases by 27 \)\times 104\( ~m \)2\( /s \)2$ (type coef(model[2])).

Since the regression line is based on a sample of data you should display it inside a band of uncertainty. Recall there are two types of uncertainty bands; a confidence band (narrow) and a prediction band (wide). The confidence band reflects the uncertainty about the line itself, which like the standard error of the mean indicates the precision by which you know the mean. In regression, the mean is not constant but rather a function of the explanatory variable.

The width of the band is inverse related to the sample size. In a large sample of data, the confidence band will be narrow reflecting a well-determined line. Note that it's impossible to draw a horizontal line that fits completely within the band. This indicates that there is a significant relationship between ACE and SST.

The band is narrowest in the middle which is understood by the fact that the predicted value at the mean SST will be the mean of ACE, whatever the slope, and thus the standard error of the predicted value at this point is the standard error of the mean of ACE. At other values of SST we need to add the variability associated with the estimated slope. This variability is larger for values of SST farther from the mean, which is why the band looks like a bow tie.

The prediction band adds another layer of uncertainty; the uncertainty about future values of ACE. The prediction band captures the majority of the observed points in the scatter plot. Unlike the confidence band, the width of the prediction band depends strongly on the assumption of normally distributed errors with a constant variance across the values of the explanatory variable.

Things are simpler with ggplot().

ace.df = data.frame(ace = ace, sst = sst)
ggplot(ace.df, aes(x = sst, y = ace)) + geom_point() + geom_smooth(method = "lm")
## Warning: Removed 5 rows containing missing values (stat_smooth).
## Warning: Removed 5 rows containing missing values (geom_point).

plot of chunk ACEvsSSTggplot

Conditional scatter plot

Scatter plots conditional on the values of a third variable can be informative. This is done with the coplot() function. The syntax is the same as above except you add the name of the conditioning variable after a vertical bar.

For example, as you saw above, there is a positive relationship between ACE and SST. The conditioning plot answers the question; is there a change in the relationship depending on values of the third variable. Here you use August SOI values as the conditioning variable and type

soi = SOI$Aug
coplot(ace ~ sst | soi, panel = panel.smooth)

plot of chunk coplot

## 
##  Missing rows :  1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15

The syntax is read 'conditioning plot of ACE versus SST given values of SOI.' The function divides the range of the conditioning variable (SOI) into six intervals with each interval having approximately the same number of years. The range of SOI values in each interval overlaps by 50%. The conditioning intervals are plotted in the top panel as horizontal bars (shingles).

The scatter plots of ACE and SST are arranged in a matrix of panels below the shingles. The panels are arranged from lower left to upper right. The lower left panel corresponds to the lowest range of SOI values (less than about -1 s.d.) and the upper right panel corresponds to the highest range of SOI values (greater than about +.5 s.d.). Half of the data points in a panel are shared with the panel to the left and half of the data points are shared with the panel to the right. This is indicated by the amount of shingle overlap.

Results show a positive, nearly linear, relationship between ACE and SST for all ranges of SOI values. However over the SOI range between -1.5 and 0 the relationship is nonlinear. ACE is least sensitive to SST when SOI is the most negative (El Nino years) as indicated by the nearly flat line in the lower-left panel. The argument panel adds a local linear curve (red line) through the set of points in each plot.

Time series

Hurricane data often take the form of a time series. A time series is a sequence of data values measured at successive times and spaced at uniform intervals. You can treat a time series as a vector and use structured data functions to generate time series.

However, additional functions are available for data that are converted to time-series objects. Time-series objects are created using the ts() function. You do this with the monthly NAO data frame as follows. First create a matrix of the monthly values skipping the year column in the data frame. Second take the transpose of this matrix (switch the rows with the columns) using the t() function and save the matrix is a vector. Finally, create a time-series object, specifying the frequency of values and the start month. Here the first value is from January 1851.

nao.m = as.matrix(NAO[, 2:13])
nao.v = as.vector(t(nao.m))
nao.ts = ts(nao.v, frequency = 12, start = c(1851, 1))

Also create a time-series object for the cumulative sum of the monthly SOI values. The is done with the cumsum() function applied to your data vector.

nao.cts = ts(cumsum(nao.v), frequency = 12, start = c(1851, 1))

This results in objects of class ts, which is used for time series having numeric time information. Additional classes for working with time series data that can handle dates and other types of time information are available in other packages. For example, the fts package implements regular and irregular time series based on POSIXct time stamps and the zoo package provides functions for most time series classes.

Time-series graph

The objects of class ts are easy to plot as a time series. For instance, you plot the cumulative sum of the NAO values using the plot() method. The method recognizes the object as a time series and plots it accordingly eliminating the need to specify a separate time variable.

par(las = 1, mgp = c(2, 0.4, 0), tcl = -0.3)
plot(nao.cts, xlab = "Year", ylab = "Cumulative NAO [s.d.]")
abline(h = 0, col = "red")
grid()
lines(nao.cts, lwd = 2)

plot of chunk plotTS

The cumulative sum indicates a pattern typical of a random walk. That is, over the long term there is a tendency for more positive-value months leading to a wandering' of the cumulative sum away from the zero line. This tendency begins to reverse in the late 20th century.

Autocorrelation

Autocorrelation is correlation between values from a single variable. For time data it refers to single series correlated with itself as a function of temporal lag. For spatial data it refers to single variable correlated with itself as a function of spatial lag, which can be a vector of distance and orientation. In both cases the term 'autocorrelation function' is used, but with spatial data the term is often qualified with 'spatial.'

As an example, save 30 random values from a standard normal distribution in a vector where the elements are considered ordered in time. First create a time-series object. Then use the lag.plot() function to create a scatter plot of the time series against a lagged copy where the lagged copy starts one time interval earlier.

t0 = ts(rnorm(30))
lag.plot(t0, lag = 3)

plot of chunk lagPlot

With N values, the plot for lag one contains N-1 points. The points are plotted using the text number indicating the temporal order so that the first point labeled '1' is given by the the coordinates (t0[1], t0[2]). The correlation at lag one can be inferred by the scatter of points. The plot can be repeated for any number of lags, but with higher lags the number of points decreases and plots are drawn for each lag.

You use the autocorrelation function acf() to quantify the correlation at various temporal lags. The function accepts univariate and multivariate numeric time-series objects and produces a plot of the autocorrelation values as a function of lag.

To create a plot of the autocorrelation and partial autocorrelation functions for the NAO time series object created in the previous section, type

par(mfrow = c(1, 2), pty = "s", las = 1)
acf(nao.ts, ci.col = "red", xlab = "Lag [years]", main = "", ylab = "Autocorrelation")
mtext("a", side = 3, line = 1, adj = 0, cex = 1.1)
pacf(nao.ts, ci.col = "red", xlab = "Lag [years]", main = "", ylab = "Partial autocorrelation")
mtext("b", side = 3, line = 1, adj = 0, cex = 1.1)

plot of chunk NAOacfpcf

The lag values plotted on the horizontal axis are plotted in units of time rather than numbers of observations. Dashed lines are the 95% confidence limits. Here the time-series object is created using monthly frequency, so the lags are given in fractions of 12 with 1.0 corresponding to a year. The maximum lag is calculated as 10 * log N where N is the number of observations. This is changed with the argument lag.max.

The lag-zero autocorrelation is fixed at 1 by convention. The non-zero autocorrelation are all less then .1 in absolute value indicative of an uncorrelated process. By default the plot includes 95% confidence limits (ci) computed as +/-1.96/sqrt(N).

The partial autocorrelation function pacf() computes the autocorrelation at lag k after the linear dependencies between lags 1 to k-1 are removed. The partial autocorrelation determines the extent of the lag in an autoregressive model. Here the partial autocorrelation vacillates between positive and negative values indicative of a moving-average process.

A moving-average process is one in which the expectation of the current value of the series is linear related to previous white noise errors.

If your regression model uses time-series data it is important to examine the autocorrelation in the model residuals. If residuals from your regression model have significant autocorrelation then the assumption of independence is violated. This violation does not bias the coefficient estimates, but with positive autocorrelation the standard errors on the coefficients tend to be too small giving you unwarranted confidence in your inferences.

Dates and times

You have various options for working with date and time data in R. The as.Date() function gives you flexibility in handling dates through the format argument. The default format is a four-digit year, a month, then a day, separated by dashes or slashes. For example, the character string “1992-8-24” will be accepted as a date by typing

Andrew = as.Date("1992-8-24")

Although the print method displays it as a character string, the object is a Date class stored as the number of days since January 1, 1970, with negative numbers for earlier dates.

If your input dates are not in the standard year, month, day order, a format string can be composed using the elements shown in the table below. For instance, if your date is specified as August 29, 2005 then you type

Katrina = as.Date("August 29, 2005", format = "%B %d, %Y")

Format codes for dates.

Code Value
%d Day of the month (decimal number)
%m Month (decimal number)
%b Month (abbreviated, e.g., Jan)
%B Month (full name)
%y Year (2 digit)
%Y Year (4 digit)

Without knowing how many leap years between hurricanes Andrew and Katrina, you can find the number of days between them by typing

difftime(Katrina, Andrew, units = "days")
## Time difference of 4753 days

Or you can obtain the number of days from today since Andrew by typing

difftime(Sys.Date(), Andrew, units = "days")
## Time difference of 7493 days

The function Sys.Date() gives the current day in year-month-day format as a Date object.

The portable operating system interface (POSIX) has formats for dates and times, with functionality for converting between time zones (Spector 2008). The POSIX date/time classes store times to the nearest second. There are two such classes differing only in the way the values are stored internally. The POSIXct class stores date/time values as the number of seconds since January 1, 1970, while the POSIXlt class stores them as a list. The list contains elements for second, minute, hour, day, month, and year among others.

The default input format for POSIX dates consist of the year, month, and day, separated by slashes or dashes with time information followed after white space.

The time information is is in the format hour:minutes:seconds or simply hour:minutes. For example, according to the U.S. National Hurricane Center, Hurricane Andrew hit Homestead Air Force Base at 9:05 UTC on August 24, 1992. You add time information to your Andrew date object and convert it to a POSIXct object.

Andrew = as.POSIXct(paste(Andrew, "09:05"), tz = "GMT")

You then retrieve your local time from your operating system as a character string and use the date-time conversion strptime() function to convert the string to a POSIXlt class.

mytime = strptime(Sys.time(), format = "%Y-%m-%d %H:%M:%S", tz = "EST5EDT")

Our time zone is U.S. Eastern time, so we use tz=“EST5EDT”. You then find the number of hours since Andrew's landfall by typing,

difftime(mytime, Andrew, units = "hours")
## Time difference of 179841 hours

Additional functionality for working with times is available in the chron and lubridate packages. In particular, lubridate (great package name) makes it easy to work with dates and times by providing functions to identify and parse date-time data, extract and modify components (years, months, days, hours, minutes, and seconds), perform accurate math on date-times, handle time zones and Daylight Savings Time (Grolemund and Wickham 2011).

For example, to return the day of the week from your object Andrew you use the wday() function in the package by typing,

require(lubridate)
wday(Andrew, label = TRUE, abbr = FALSE)
## [1] Monday
## 7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday

Other examples of useful functions in the package related to the Andrew time object include, the year, was it a leap year, what week of the year was it, and what local time was it. Finally, what is the current time in Chicago?

year(Andrew)
## [1] 1992
leap_year(Andrew)
## [1] TRUE
week(Andrew)
## [1] 34
with_tz(Andrew, tz = "America/New_york")
## [1] "1992-08-24 05:05:00 EDT"
now(tz = "America/Chicago")
## [1] "2013-02-28 12:02:42 CST"

Maps

Maps are among the most compelling graphics as the space they map is the space in which hurricanes occur. Various packages are available for creating maps.

Boundaries

Sometimes all that is needed is a reference map to show your study location. This can be created using state and country boundaries. For example, the maps package is used to draw country and state borders. To draw a map of the United States with state boundaries, type

require(maps)
map("state")

The function map() creates the country outline and adds the state boundaries. The package contains outlines for countries around the world (e.g., type map()).

map("state", interior = FALSE)
map("state", boundary = FALSE, col = "gray", add = TRUE)

plot of chunk mapexample

The coordinate system is latitude and longitude, so you can overlay other spatial data. As an example, first input the track of Hurricane Ivan (2004) as it approached the U.S. Gulf coast. Then list the first six rows of data.

con = "http://hurricaneclimatology.squarespace.com/storage/chapter-5/Ivan.txt"
Ivan = read.table(con, header = TRUE)
head(Ivan)
##   Year Mo Da Hr    Lon   Lat  Wind WindS Pmin Rmw   Hb Speed L Lhr
## 1 2004  9 15  8 -87.56 25.94 117.9 118.3  917  37 1.27 11.62 0 -24
## 2 2004  9 15  9 -87.65 26.12 117.7 117.3  917  37 1.27 12.05 0 -23
## 3 2004  9 15 10 -87.74 26.31 117.5 116.4  917  37 1.26 12.38 0 -22
## 4 2004  9 15 11 -87.82 26.50 117.2 115.5  918  37 1.26 12.60 0 -21
## 5 2004  9 15 12 -87.90 26.70 116.7 115.0  918  37 1.26 12.74 0 -20
## 6 2004  9 15 13 -87.97 26.90 116.2 114.8  919  37 1.26 12.78 0 -19

The data frame Ivan contains the latitude and longitude position of the hurricane every hour from 24 hours before landfall until 12 hours after landfall.

Here your geographic domain is the southeast, so first create a character vector of state names.

cs = c("texas", "louisiana", "mississippi", "alabama", "florida", "georgia", 
    "south carolina")

Next use map() to plot the state boundaries and fill the state polygons with a gray shade. Finally connect the hourly location points with the lines() function and add an arrow head.

map("state", region = cs, boundary = FALSE, col = "gray", fill = TRUE)
Lo = Ivan$Lon
La = Ivan$Lat
n = length(Lo)
lines(Lo, La, lwd = 2.5, col = "red")
arrows(Lo[n - 1], La[n - 1], Lo[n], La[n], lwd = 2.5, length = 0.1, col = "red")

plot of chunk IvanTrackMap

Hurricane Ivan moved northward from the central Gulf of Mexico and made landfall in the western panhandle region of Florida before moving into southeastern Alabama.

The boundary data in the maps package is sufficient for use with small scale maps as the number of boundary points is not sufficient for close-up (high resolution) views. Higher-resolution boundary data are available in the mapdata package.

Spatial data types

Point data

Consider the set of events defined by the location at which a hurricane first reaches its lifetime maximum intensity. The data are available in the file LMI.txt and are input by typing

con = "http://hurricaneclimatology.squarespace.com/storage/chapter-5/LMI.txt"
LMI.df = read.table(con, header = TRUE)
LMI.df$WmaxS = LMI.df$WmaxS * 0.5144
head(LMI.df[, c(4:10, 11)])
##               name   Yr Mo Da hr    lon   lat   Wmax
## 30861.5 DENNIS     1981  8 20 23 -70.82 36.97  70.45
## 30891.4 EMILY      1981  9  6 10 -58.14 40.63  80.60
## 30930.2 FLOYD      1981  9  7  2 -69.06 26.76 100.44
## 30972.2 GERT       1981  9 11 14 -71.66 29.40  90.46
## 31003.5 HARVEY     1981  9 14 23 -62.63 28.26 115.11
## 31054.4 IRENE      1981  9 28 16 -56.41 27.93 105.49

The Wmax column is a spline interpolated maximum wind speed and WmaxS is first smoothed then spline interpolated to allow time derivatives to be computed.

The raw wind speed values are given in 5 kt increments. Although knots (kt) are the operational unit used for reporting tropical cyclone intensity to the public in the United States, here you use the SI units of m/s.

We use the term 'intensity' as shorthand for 'maximum wind speed,' where maximum wind speed refers to the estimated fastest wind velocity somewhere in the core of the hurricane. Lifetime maximum refers to the highest maximum wind speed during the life of the hurricane.

You draw a map of the event locations with the plot() method using the longitude coordinate as the x variable and the latitude coordinate as the y variable by typing

with(LMI.df, plot(lon, lat, pch = 19))
map("world", col = "gray", add = TRUE)
grid()

plot of chunk locationMap

Adding country borders and latitude/longitude grid lines [grid() function] enhances the geographic information. The argument pch specifies a point character using an integer code. Here 19 refers to a solid circle (type ?points for more information). The with() function allows you use the column names from the data frame in the plot() method.

Note the order of function calls. By plotting the events first, then adding the country borders, the borders are clipped to the plot window. The dimensions of the plot window are default to be slightly larger than the range of the longitude and latitude coordinates. The function chooses a reasonable number of axis tics that are placed along the range of coordinate values at reasonable intervals.

Since the events are marked by storm intensity it is informative to add this information to the map. Hurricane intensity, as indexed by an estimate of the wind speed maximum at 10 m height somewhere inside the hurricane, is a continuous variable. You can choose a set of discrete intensity intervals and group the events by these class intervals. For example, you might want to choose the Saffir/Simpson hurricane intensity scale.

To efficiently communicate differences in intensities with colors, you should limit the number classes to six or less. The package classInt is a collection of functions for choosing class intervals. Here you require the package and create a vector of lifetime maxima. You then obtain class boundaries using the classIntervals() function.

Here the number of class intervals is set to five and the method of determining the interval breaks is based on Jenks optimization (style=“jenks”). Given the number of classes, the optimization minimizes the variance of the values within the intervals while maximizing the variance between the intervals.

require(classInt)
lmi = LMI.df$WmaxS
q5 = classIntervals(lmi, n = 5, style = "jenks", dataPrecision = 1)

The dataPrecision argument sets the number of digits to the right of the decimal place.

Next you choose a palette of colors. This is best left to someone with an understanding of hues and color separation schemes. The palettes described and printed in Brewer et al. (2003) for continuous, diverging, and categorical variables can be examined on maps at http://colorbrewer2.org/. Select the HEX radio button for a color palette of your choice and then copy and paste the hex code into a character vector preceded by the pound symbol.

For example, here you create a character vector (cls) of length 5 containing the hex codes from the color brewer website from a sequential color ramp ranging between yellow, orange, and red.

cls = c("#FFFFB2", "#FECC5C", "#FD8D3C", "#F03B20", "#BD0026")

To use your own set of colors simply modify this list. A character vector of color hex codes is generated automatically with functions in the colorRamps package.

The empirical cumulative distribution function of cyclone intensities with the corresponding class intervals and colors is then plotted by typing

plot(q5, pal = cls, main = "", xlab = expression(paste("Wind Speed [m ", s^-1, 
    "]")), ylab = "Cumulative Frequency")

plot of chunk plotCumFreq

The points (with horizontal dashes) are the lifetime maximum intensity wind speeds in rank order from lowest to highest.

Once you are satisfied with the class intervals and color palette, you can plot the events on a map. First you need to assign a color for each event depending on its wind speed value. This is done with the findColours() function as

q5c = findColours(q5, cls)

Now, instead of black dots with a color bar, each value is assigned a color corresponding to the class interval. For convenience you create the axis labels and save them as an expression object. You do this with the expression() and paste() functions to get the degree symbol.

xl = expression(paste("Longitude [", {
}^o, "E]"))
yl = expression(paste("Latitude [", {
}^o, "N]"))

Since the degree symbol is not attached to a character you use {} in front of the superscript symbol. You again use the plot() method on the location coordinates, but this time set the color argument to the corresponding vector of colors saved in q5c.

plot(LMI.df$lon, LMI.df$lat, xlab = xl, ylab = yl, col = q5c, pch = 19)
points(LMI.df$lon, LMI.df$lat)
map("world", add = TRUE)
axis(3)
axis(4)
grid()
legend("bottomright", bg = "white", fill = attr(q5c, "palette"), legend = names(attr(q5c, 
    "table")), title = expression(paste("Wind Speed [m ", s^-1, "]")))