Hurricane Climate: Week 3

Last week you learned how to use R to (1) compute summary statistics for data on hurricane activity, (2) generate random numbers as a way to examine likelihood models, (3) work with probability distributions. Some questions regarding hurricane climate variability and change were discussed. I also rendered an opinion about data reliability through the ages.

Today: Continue with classical statistics: statistical tests, correlation, and regression.

The $ p $-value is an estimate of the probability that a particular result, or a result more extreme than the result observed, could have occurred by chance if the null hypothesis is true. In the present case, if the true mean is $ \mu_0 $ what is the probability that your sample mean is as far or farther from $ \mu_0 $ as it is? In short, the $ p $-value is a measure of the credibility of the null hypothesis. The higher the $ p $-value the more credible the null hypothesis appears given your sample of data.

But, the $ p $-value is best interpreted as evidence against the null hypothesis, thus a small value indicates evidence to reject the null. The interpretation is not black and white. A convenient way to express the evidence is given in this table.

The $ p $-value as evidence against the null hypothesis.

$ p $-value	evidence
0–0.01	convincing
0.01–0.05	moderate
0.05–0.15	suggestive, but inconclusive
$ > $ 0.15	no

Consider the area-averaged North Atlantic sea-surface temperature (SST) values each August © as an example. Input the monthly data and save the values for August in a separate vector by typing

con = "http://hurricaneclimatology.squarespace.com/storage/chapter-3/SST.txt"
SST = read.table(con, header = TRUE)
sst = SST$Aug

Begin with a summary table.

summary(sst)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    22.6    23.0    23.1    23.1    23.3    23.9       5

You might be interested in the hypothesis that the values deviate significantly from 23.2C. The task for you is to test whether this distribution has a mean $ \mu=23.2 $. Assuming the normal distribution is an adequate model for your data, this is done using the t.test() function as follows.

t.test(sst, mu = 23.2)

## 
##  One Sample t-test
## 
## data:  sst 
## t = -2.942, df = 154, p-value = 0.003759
## alternative hypothesis: true mean is not equal to 23.2 
## 95 percent confidence interval:
##  23.10 23.18 
## sample estimates:
## mean of x 
##     23.14

The output begins with a description of the test followed by the name of the data object used. The next line contains the value of the $ t $-statistic, degrees of freedom, and the $ p $-value.

The df (degrees of freedom) is the number of values in calculating the $ t $-statistic that are free to vary. Here it is equal to the number of years (number of independent pieces of information) that go into calculating the $ t $-statistic minus the number of statistics used in the intermediate steps of the calculation.

The $ p $-value is 0.004 indicating conclusive evidence against the null hypothesis that the mean is 23.2. Said another way, if the true mean is 23.2, then the sample of data we have is very unusual.

A sentence regarding the alternative hypothesis is printed next. It has two pieces of information; the value corresponding to your hypothesis and whether your test is one- or two-sided. Here it states 'not equal to', indicating a two-sided test.

You specify a one-sided test against the alternative of a larger $ \mu $ by using the alternative=“greater” argument. For instance you might hypothesize the temperature to exceed a certain threshold value. Abbreviated argument names often work. For example, here it is okay to write alt=“g” to get the one-sided, greater than, alternative.

The next output is the 95% confidence interval for the true mean. You can think of it as defining a set of hypothetical mean values, such that if they were used as values for your null hypothesis (instead of 23.2) they would lead to a $ p $-value of 0.05 or greater (failure to reject the null). You can specify a different confidence level with the conf.level argument. For example, conf.level=.99 will give you a 99% interval.

The final bit of output is the mean value from your sample. It is the best single estimate for the true mean. Note that you are not testing the data. You are testing a hypothesis about some hypothetical value using your data.

In classical inference you state a hypothesis that, for example, the population mean has a value equal to $ \mu_0 $. You then use your data to see if there is evidence to reject it. The evidence is summarized as a $ p $-value. The hypothesis is a model (normal distribution centered on mu).

A $ p $-value less than 0.15 is taken as suggestive, but inconclusive evidence that the hypothesis is wrong, while a $ p $-value less than 0.01 is convincing evidence it is wrong.

Wilcoxon Signed-Rank Test

The Wilcoxon signed-rank test (Mann-Whitney U-test) is a non-parametric alternative to the $ t $ test. The hypothesized mean ($ \mu_0 $) is subtracted from each observation. Ranks are assigned to each difference with the smallest difference given a rank of one. Then, the set of absolute magnitudes of each of the differences are ranked.

The function rank() is used to obtain ranks from a set of values. For example, type

rank(c(2.1, 5.3, 1.7, 1.9))

## [1] 3 4 1 2

The function returns the ranks with each value assigned a ranking from lowest to highest. Here the value of 2.1 in the first position of the data vector is ranked third and the value of 1.7 in the fourth position is ranked one.

The ranks of the first 18 differences between the observed and null hypothesis value are

x = sst - 23.2
r = rank(abs(x))
r[1:18]

##  [1] 156.0 157.0 158.0 159.0 160.0  34.0 112.0 124.0  91.0 102.0  11.0
## [12] 147.0 150.0  37.0  96.0   3.5  77.0  20.0

This says there are 156 years that have a difference (absolute value) less than or equal to the first year's SST value. By default ties are handled by averaging the ranks so for an even number of ties the rank are expressed as a fractional half, otherwise they are a whole number.

The test statistic V is the sum of the ranks corresponding to the values that are above the hypothesized mean

sum(r[x > 0], na.rm = TRUE)

## [1] 4224

Assuming only that the distribution is symmetric around $ \mu_0 $, the test statistic corresponds to selecting each rank from 1 to $ n $ with probability of .5 and calculating the sum. The distribution of the test statistic can be calculated exactly. For large samples the distribution is approximately normal.

The application of the Wilcoxon test is done in the same way as the application of the $ t $ test. You specify the data values in the first argument and the hypothesized population mean in the second argument.

wilcox.test(sst, mu = 23.2)

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  sst 
## V = 4170, p-value = 0.001189
## alternative hypothesis: true location is not equal to 23.2

The $ p $-value indicates convincing evidence against the null hypothesis.

If the assumptions are met, then the $ t $ test will be more efficient by about 5% relative to the Wilcoxon test. That is, for a given sample size, the $ t $ test better maximizes the probability that the test will reject the null hypothesis when it is false. That is, the $ t $ test has more power than the Wilcoxon test.

The Wilcoxon test has problems when there are ties in the ranks for small samples. By default (if exact is not specified), an exact $ p $-value is computed if the samples contain less than 50 values and there are no ties. Otherwise a normal approximation is used.

Two-Sample Test

The two-sample $ t $ test is used to test the hypothesis that two samples come from distributions with the same population mean. The theory is the same as that employed in the one-sample test. Vectors are now doubly indexed ($ x_{1,1}, \ldots, x_{1,n_1} $ and $ x_{2,1}, \ldots, x_{2,n_2} $). The first index identifies the sample and the second identifies the case.

The assumption is that normal distributions $ N(\mu_1, \sigma_1^2) $ and $ N(\mu_2, \sigma_2^2) $ adequately describe your values and your interest is to test the null hypothesis $ \mu_1 = \mu_2 $. You calculate the $ t $-statistic as \[ t = \frac{\bar x_1 - \bar x_2}{\sqrt{\hbox{s.e.}(\bar x_1)^2+\hbox{s.e.}(\bar x_2)^2}} \] where the denominator is the standard error of the difference in means and $ \hbox{s.e.}(\bar x_i) = s_i/\sqrt{n_i} $. If you assume the two samples have the same variance ($ s_1^2 = s_2^2 $) then you calculate the s.e.($ \bar x $)s using a single value for $ s $ based on the standard deviation of the values over both samples. Under the null hypothesis that the population means are the same, the $ t $-statistic will follow a $ t $-distribution with $ n_1+n_2-2 $ degrees of freedom.

If you don't assume equal variance the $ t $-statistic is approximated by a $ t $-distribution after adjusting the degrees of freedom by the Welch procedure. By default the t.test() function uses the Welch procedure resulting in a non-integer degrees of freedom. Regardless of the adjustment, the two-sample test will usually give about the same result unless the sample sizes and the standard deviations are quite different.

As an example, suppose you are interested in whether the June NAO values have mean values that are different depending on hurricane activity along the Gulf coast later in the year.

con = "http://hurricaneclimatology.squarespace.com/storage/chapter-3/H.txt"
H = read.table(con, header = TRUE)
con = "http://hurricaneclimatology.squarespace.com/storage/chapter-3/NAO.txt"
NAO = read.table(con, header = TRUE)
nao = NAO$Jun

First create two samples of NAO values. The first sample contains June NAO values in years with no Gulf hurricanes and the second sample contains NAO values in years with at least two Gulf hurricanes. This is done using the subset() function.

nao.s1 = subset(nao, H$G == 0)
nao.s2 = subset(nao, H$G >= 2)

Summarize the values with the mean, standard deviation, and sample size.

mean(nao.s1)

## [1] -0.277

sd(nao.s1)

## [1] 1.51

length(nao.s1)

## [1] 80

mean(nao.s2)

## [1] -1.057

sd(nao.s2)

## [1] 1.03

length(nao.s2)

## [1] 22

The mean NAO value is larger during inactive hurricane years but is it significantly larger? The standard deviation is also larger and so is the sample size.

Your null hypothesis is that the population mean of June NAO values during active Gulf years is equal to the population mean of June NAO values during inactive years. The test is performed with the t.test() function with the data vectors as the two arguments. By default it uses the Welch procedure.

t.test(nao.s1, nao.s2)

## 
##  Welch Two Sample t-test
## 
## data:  nao.s1 and nao.s2 
## t = 2.815, df = 48.67, p-value = 0.007015
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##  0.2231 1.3365 
## sample estimates:
## mean of x mean of y 
##    -0.277    -1.057

The output is similar as before. Assuming the null hypothesis of no difference in population means is correct, this $ t $ value (or a value larger in magnitude) has a small probability of occurring by chance providing compelling evidence that June NAO values are different between the two samples.

As with the one-sample test the alternative hypothesis, which is that the true difference in means is not equal to zero, is stated as part of the output. This is the most common alternative in these situations.

The confidence interval (CI) refers to the difference in sample means (mean from sample 1 minus mean from sample 2). If you are willing to assume the variances are equal (for example both samples come from the same population), you can specify the argument var.equal=TRUE. In this case the number of degrees of freedom is a whole number, the $ p $-value is larger, and the confidence interval wider.

Statistical Formula

While you might consider the data in separate vectors, this is not the best way to do things in R. Instead of creating subsets of the object nao based on values in the object H, you create a data frame with two parallel columns. Include all values for the NAO in one column and the result of a logical operation on Gulf hurricane activity in a separate column.

gulf = H$G > 1
nao.df = data.frame(nao, gulf)
tail(nao.df)

##       nao  gulf
## 155 -1.00  TRUE
## 156 -0.41 FALSE
## 157 -3.34 FALSE
## 158 -2.05  TRUE
## 159 -3.05 FALSE
## 160 -2.40 FALSE

This shows the NAO values and whether or not there was two or more Gulf hurricanes in corresponding years.

The goal is to see whether there is a shift in level of the NAO between the two groups of hurricane activity years (TRUE and FALSE). Here the groups are years with two or more Gulf hurricanes (TRUE) and years with one or fewer hurricanes (FALSE).

With this setup you specify a two-sample $ t $-test using the tilde (~) operator as

t.test(nao ~ gulf, data = nao.df)

## 
##  Welch Two Sample t-test
## 
## data:  nao by gulf 
## t = 3.102, df = 35.97, p-value = 0.00373
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##  0.2707 1.2935 
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##             -0.2747             -1.0568

The object to the left of the tilde is the variable you want to test and the object to the right is the variable used for testing. The tilde is read as 'described by' or 'conditioned on.' That is, the June NAO values are described by Gulf coast hurricane activity. This is how statistical models are specified in R. You will see this model structure throughout the book. Note that by using the data=nao.df you can use the column vectors in the data frame by name in the model formula.

The conclusion is the same. Years of high and low Gulf hurricane activity appear to be presaged by June NAO values that are significantly different. The output is essentially the same although the group names are taken from the output of the logical operation. Here FALSE refers to inactive hurricane years.

Two-Sample Wilcoxon Test

As with the one-sample Wilcoxon test, the two-sample counterpart is based on replacing your data values by their corresponding rank. This is done without regard to group. The test statistic $ W $ is then computed as the sum of the ranks in one group.

The function is applied using the model structure as

wilcox.test(nao ~ gulf, data = nao.df)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  nao by gulf 
## W = 2064, p-value = 0.006874
## alternative hypothesis: true location shift is not equal to 0

The results are similar to those found using the $ t $-test and are interpreted as convincing evidence of a relationship between late spring NAO index values and hurricane activity along the Gulf coast of the United States.

Compare Variances

It's not necessary to assume equal variances when testing for differences in means. Indeed this is the default option with t.test(). Yet your interest could be whether the variability is changing. For instance you might speculate that the variability in hurricane activity will increase with global warming.

Note that the variance is strictly positive. Given two samples of data, the ratio of variances will be unity if the variances are equal. Under the assumption of equal population variance the $ F $-statistic, as the ratio of the sample variances, has an $ F $-distribution with two parameters. The parameters are the two samples sizes minus one.

The $ F $-distribution is positively skewed meaning the tail on the right is longer than the tail on the left. You can plot this by typing

par(las = 1, mgp = c(2, 0.4, 0), tcl = -0.3)
leg = c("F(80, 120)", "F(12, 8)")
curve(df(x, df1 = 80, df2 = 120), from = 0, to = 3, lwd = 2, ylab = "Probability density")
curve(df(x, df1 = 12, df2 = 8), from = -3, to = 3, add = TRUE, col = 2, lwd = 2)
legend("topright", legend = leg, col = c(1, 2), lwd = c(2, 2))

plot of chunk Fdistribution

A larger sample size results in a density centered on one and more symmetric.

The function var.test() is called in the same way as the function t.test(), but performs an $ F $ test on the ratio of the group variances.

var.test(nao ~ gulf, data = nao.df)

## 
##  F test to compare two variances
## 
## data:  nao by gulf 
## F = 2, num df = 137, denom df = 21, p-value = 0.06711
## alternative hypothesis: true ratio of variances is not equal to 1 
## 95 percent confidence interval:
##  0.9499 3.5839 
## sample estimates:
## ratio of variances 
##                  2

The magnitude of the $ p $-value provides suggestive but inconclusive evidence of a difference in population variance. Note that the 95% confidence interval on the $ F $-statistic includes the value of one as you would expect given the $ p $-value.

The $ F $-test is sensitive to departures from normality. Also, for small data sets the confidence interval will be quite wide often requiring you to take the assumption of equal variance as a matter of belief.

Pearson's Product-Moment Correlation

Correlation extends the idea of comparing one variable in relation to another. Correlation indicates the amount and the direction of association between two variables. If hurricanes occur more often when the ocean is warmer, then you say that ocean temperature is positively correlated with hurricane incidence; as one goes up, the other goes up.

Pearson's product-moment correlation coefficient is derived from the bivariate normal distribution of two variables, where the theoretical correlation describes contour ellipses about the two-dimensional densities. It is the workhorse of climatological studies. If both variables are scaled to have unit variance, then a correlation of zero corresponds to circular contours and a correlation of one corresponds to a line segment.

The figure below shows two examples one where the variables $ x $ and $ y $ have a small positive correlation and the other where they have a fairly large negative correlation. The points are generated from a sample of bivariate normal values with a Pearson product-moment correlation of 0.2 and $-$0.7. The contours enclose the 75 and 95% probability region for a bivariate normal distribution with mean of zero, unit variances, and corresponding correlations. You create this figure by typing

require(MASS)

## Loading required package: MASS

require(ellipse)

## Loading required package: ellipse

## Warning: there is no package called 'ellipse'

par(mfrow = c(1, 2), pty = "s", las = 1, mgp = c(2, 0.4, 0), tcl = -0.3)
plot(mvrnorm(100, mu = c(0, 0), Sigma = matrix(c(1, 0.2, 0.2, 1), 2, 2)), ylim = c(-4, 
    4), xlim = c(-4, 4), pch = 1, xlab = "x", ylab = "y")
lines(ellipse(0.2, level = 0.75), col = "red", lwd = 2)

## Error: could not find function "ellipse"

lines(ellipse(0.2, level = 0.95), col = "red", lwd = 2)

## Error: could not find function "ellipse"

mtext("a", side = 3, line = 1, adj = 0, cex = 1.1)
plot(mvrnorm(100, mu = c(0, 0), Sigma = matrix(c(1, -0.7, -0.7, 1), 2, 2)), 
    ylim = c(-4, 4), xlim = c(-4, 4), pch = 1, xlab = "x", ylab = "y")
lines(ellipse(-0.7, level = 0.75), col = "red", lwd = 2)

## Error: could not find function "ellipse"

lines(ellipse(-0.7, level = 0.95), col = "red", lwd = 2)

## Error: could not find function "ellipse"

mtext("b", side = 3, line = 1, adj = 0, cex = 1.1)

plot of chunk correlationscatterplot

The Pearson correlation coefficient between $ x $ and $ y $ is \[ r(x,y) = \frac{\sum (x_i - \bar x) (y_i - \bar y)}{\sqrt{\sum (x_i - \bar x)^2 \sum (y_i - \bar y)^2}} \] The Pearson correlation is often called the 'linear' correlation since the absolute value of $ r $ will be one when there is a perfect linear relationship between $ x_i $ and $ y_i $.

The function cor() is used to compute the correlation between two or more vectors. For example to get the linear correlation between the May and June values of the NAO, type

cor(NAO$May, NAO$Jun)

## [1] 0.03682

The value indicates weak positive correlation. Note that the order of the vectors in the function is irrelevant as $ r(x,y)=r(y,x) $.

If there are missing values in $ x $ or $ y $ type

cor(SST$Aug, SST$Sep, use = "complete.obs")

## [1] 0.944

Here the correlation value indicates strong positive correlation.

This value of $ r $ estimated from the data is a random variable and is thus subject to sampling variation. For instance, adding another year's worth of data will result in a value for $ r $ that is somewhat different. Typically your hypothesis is that the population correlation is zero. As might be guessed from the differences in $ r $ your conclusions about this hypothesis will likely be different for the SST and NAO data.

You can ask the question differently. For example, in 1000 samples of $ x $ and $ y $ each of size 30 from a population with zero correlation, what is the largest value of $ r $? You answer this question using simulations by typing

set.seed(3042)
n = 30
cc = numeric()
for (i in 1:1000) {
    x = rnorm(n)
    y = rnorm(n)
    cc[i] = cor(x, y)
}
mean(cc)

## [1] -0.0148

max(cc)

## [1] 0.5688

The variable n sets the sample size and you simulate 1000 different correlation coefficients from different independent samples of x and y. The average correlation is close to zero as expected but the maximum correlation is quite large.

High correlation can arise by chance. Thus when you report a correlation coefficient a confidence interval on your estimate or a test of significance should be included. This is done with the cor.test() function. The test is based on transforming $ r $ to a statistic that has a $ t $-distribution.

Returning to the NAO example, to obtain a confidence interval on the correlation between the May and June values of the NAO and a test of significance, type

cor.test(NAO$May, NAO$Jun)

## 
##  Pearson's product-moment correlation
## 
## data:  NAO$May and NAO$Jun 
## t = 0.4632, df = 158, p-value = 0.6439
## alternative hypothesis: true correlation is not equal to 0 
## 95 percent confidence interval:
##  -0.1190  0.1909 
## sample estimates:
##     cor 
## 0.03682

The output also gives a $ p $-value as evidence in support of the null hypothesis of no correlation. It also gives a CI for the correlation estimate.

Repeat this example using the January and September values of SST. What is the confidence interval on the correlation estimate? How would you describe the evidence against the null hypothesis of zero correlation in this case?

Spearman's Rank and Kendall's $ \tau $ Correlation

An alternative is Spearman's rank ($ \rho $) correlation, which overcomes the effect of outliers and skewness by considering the rank of the data rather than the magnitude. The Spearman correlation coefficient is defined as the Pearson correlation coefficient between the ranked variables.

The Pearson correlation is the default in the cor.test() function. You change this with the method argument. To obtain Spearman's rank correlation and the associated test of significance, type

cor.test(H$G, H$FL, method = "spearman")

## Warning: Cannot compute exact p-values with ties

## 
##  Spearman's rank correlation rho
## 
## data:  H$G and H$FL 
## S = 551867, p-value = 0.01524
## alternative hypothesis: true rho is not equal to 0 
## sample estimates:
##    rho 
## 0.1916

The results provide suggestive evidence against the null hypothesis of zero correlation.

Another correlation is Kendall's $ \tau $, which is based on counting the number of concordant and discordant point pairs from your data. For two data vectors $ x $ and $ y $ each of length $ n $, a point at location $ i $ is given in two-dimensional space as ($ x_i, y_i $). A point pair is defined as [$ (x_i, y_i); (x_j, y_j) $] for $ i \neq j $.

A point pair is concordant if the difference in the $ x $ values is of the same sign as the difference in the $ y $ values, otherwise it is discordant. The value of Kendall's $ \tau $ is the number of concordant pairs minus the number of discordant pairs divided by the total number of unique point pairs, which is $ n(n-1)/2 $ where $ n $ is the sample size. For a perfect correlation, either all point pairs are concordant or all pairs are discordant. Under zero correlation there are as many concordant pairs as discordant pairs.

Repeat the call to function cor.test() on coastal hurricane activity, but now use the kendall method.

cor.test(H$G, H$FL, method = "kendall")

## 
##  Kendall's rank correlation tau
## 
## data:  H$G and H$FL 
## z = 2.442, p-value = 0.01461
## alternative hypothesis: true tau is not equal to 0 
## sample estimates:
##    tau 
## 0.1744

Bootstrap Confidence Intervals

Kendall's tau and Spearman's rank correlations do not come with confidence intervals. You should always report a confidence interval. In this case you use a procedure called 'bootstrapping', which is a resampling technique. The idea is to sample the values from your data with replacement using the sample() function. The sample size is the same as the size of your original data. The sample is called a bootstrap replicate. You then compute the statistic of interest from your replicate. The bootstrap value will be different than the value computed from your data because the replicate contains repeats and not all data are included. You repeat the procedure many times collecting all the bootstrap values. You then use the quantile() function to determine the lower and upper quantiles corresponding to the 0.025 and 0.975 probabilities.

The function boot() from the package boot generates bootstrap replicates of any statistic applied to your data. It has options for parametric and non-parametric resampling. To implement bootrapping for Kendall's tau you first create a function as follows.

mybootfun = function(x, i) {
    Gbs = x$G[i]
    Fbs = x$FL[i]
    return(cor.test(Gbs, Fbs, method = "k")$est)
}

Your function has two variables; the data x and an index variable i. Next you generate 1000 bootstrap samples and calculate the confidence intervals by typing

tau.bs = boot(data = H, statistic = mybootfun, R = 1000)

## Error: could not find function "boot"

boot.ci(tau.bs, conf = 0.95)

## Error: could not find function "boot.ci"

The boot() function must be run prior to running the boot.ci() function. The result is a 95% CI about the estimated $ \tau $.

Causation

A significant correlation between ocean warmth and hurricane activity does not prove causality. Moreover since the association is symmetric, it does not say that $ x $ causes $ y $ any more than it says $ y $ causes $ x $. This is why you frequently hear 'correlation does not equal causation.' The problem with this adage is that it ignores the fact that correlation is needed for causation. It is necessary, but insufficient. When correlation is properly interpreted it is indispensable in the study of hurricanes and climate.

Your correlation results are more meaningful if you explain how the variables are physically related. Several different studies showing a consistent correlation between two variables using different time and space scales, and over different time periods and different regions, provide greater evidence of an association than a single study. However, if you want proof that a single factor causes hurricanes, then correlation is not enough.

Linear Regression

Correlation is the most widely used statistic in climatology, but linear regression is arguably the most important statistical model. When you say that variable $ x $ has a linear relationship to variable $ y $ you mean $y= a + b x $, where $ a $ is $ y $-intercept and $ b $ is the slope of the line. You call $ x $ the independent variable and $ y $ the dependent variable because the value of $ y $ depends on the value of $ x $.

But in statistics you don't assume these variables have a perfect linear relationship. Instead, in describing the relationship between two random vectors $ x_i $ and $ y_i $, you add an error term ($ \varepsilon $) to the equation such that \[ y_i = \alpha + \beta x_i + \varepsilon_i \] You assume the values $ \varepsilon_i $ are iid $ N(0, \sigma^2) $.

The slope of the line is the regression coefficient $ \beta $, which is the increase in the average value of $ y $ per unit change in $ x $. The line intersects the $ y $-axis at the intercept $ \alpha $. The vector $ x $ is called the explanatory variable and the vector $ y $ is called the response variable.

The equation describes a regression of the variable $ y $ onto the variable $ x $. This is always the case. You regress your response variable onto your explanatory variable(s), where the word 'regression' refers to a model for the mean of the response variable.

The model consists of three parameters $ \alpha $, $ \beta $, and $ \sigma^2 $. For a set of explanatory and response values, the parameters are estimated using the method of least squares. The method finds a set of $ \alpha $ and $ \beta $ values that minimize the sum of squared residuals given as \[ \hbox{SS}_{res} = \sum_i [y_i - (\alpha+\beta x_i)]^2 \]

The solution to this minimization is a set of equations given by \[ \hat \beta = \frac{\sum(x_i - \bar x) (y_i - \bar y)}{\sum (x_i - \bar x)^2} \\ \hat \alpha = \bar y - \hat \beta \bar x \] that define estimates for $ \alpha $ and $ \beta $. The residual variance is $ \hbox{SS}_{res} / (n-2) $, where $ \hat \alpha $ and $ \hat \beta $ are used in Eq.~\ref{eq:sse}.

The regression line is written as \[ \hat y_i = \hat \alpha + \hat \beta x_i \] The method of least squares to determine the $ \hat \alpha $ and $ \hat \beta $ values is performed by the function lm() (linear model). If you are interested in regressing the August values of Atlantic SST onto the preceding January SST values, you type

lm(SST$Aug ~ SST$Jan)

## 
## Call:
## lm(formula = SST$Aug ~ SST$Jan)
## 
## Coefficients:
## (Intercept)      SST$Jan  
##        7.11         0.84

The argument to lm is a model formula where the tilde symbol as we've seen is read as 'described by.' Or to be more complete in this particular case, we state that the August SST values are described by the January SST values using a linear regression model. In this case you have a single explanatory variable so the formula is simply y~x and you call the model 'simple linear regression'.

The response variable is the variable you are interested in modeling. You must decide which variable is your response variable before beginning. Unlike correlation, a regression of $ y $ onto $ x $ is not the same as a regression of $ x $ onto $ y $. Your choice depends on the question you want answered. You get no guidance by examining your data nor will R tell you. Here you choose August SST as the response since it is natural to consider using information from an earlier month to predict what might happen in a later month.

Parameter estimates are given in the table of coefficients. The estimated intercept value ($ \hat \alpha $) is given under Intercept and the estimated slope value ($ \hat \beta $) under SST$Jan. The units on the intercept parameter are the same as the units of the response variable, here C. The units on the slope parameter are the units of the response divided by the units of the explanatory variable, here C per C. Thus you interpret the slope value in this example as follows: for every 1C increase in January SST, the August SST increases by 0.84C on average.

The slope and intercept values will deviate somewhat from the true values due to sampling variation. One way to examine how much deviation is to take many samples from the data and, with each sample, use the lm() function to determine the parameter. The code below does this for the slope parameter using January SST values as the explanatory variable and August SST values as the response.

sl = numeric()
for (i in 1:1000) {
    id = sample(1:length(SST$Aug), replace = TRUE)
    sl[i] = lm(SST$Aug[id] ~ SST$Jan[id])$coef[2]
}
round(quantile(sl), digits = 2)

##   0%  25%  50%  75% 100% 
## 0.59 0.79 0.84 0.89 1.06

You sample from the set of row indices and use the same index for the January and the August values. Results indicate that 50% of the slopes fall between the values 0.79 and 0.89C per C.

Although illustrative, sampling is not needed. Recall you calculated the s.e.($ \bar x $) from a single sample to describe the variability of the sample mean. You do the same to calculate the standard error of the slope (and intercept) from a sample of $ x $ and $ y $ values. These standard errors, denoted s.e.($ \hat \beta $) and s.e.($ \hat \alpha $), are used for inference and to compute confidence intervals.

Typically the key inference is a test of the null hypothesis that the population value for the slope is zero. This implies the line is horizontal and there is no relationship between the response and the explanatory variable. The test statistic in this case is \[ t = \frac{\hat \beta}{\hbox{s.e.}(\hat \beta)} \] which follows a $ t $ distribution with $ n-2 $ degrees of freedom if the true slope is zero. Similarly you can test the null hypothesis that the intercept is zero, but this often has little physical meaning because it typically involves an extrapolation outside the range of your $ x $ values.

The value for the test statistic ($ t $ value) is not provided as part of the raw output from lm. The result is a model object. This is a key concept. In Chapter~\ref{chap:Rtutorial} you encountered data objects. You created structured data vectors and input data frames from a spreadsheet. The saved objects are listed in your working session by typing objects(). Functions, like table(), are applied to these objects to extract information.

In the same way, a model object contains a lot more information than what is printed. An important extractor function is summary(). You saw previously that applied to a data frame object, the function extracts statistics about the values in each column. When applied to a model object it extracts information about the model.

For instance, to obtain the statistics from the regression model of August SST onto January SST, type

summary(lm(SST$Aug ~ SST$Jan))

## 
## Call:
## lm(formula = SST$Aug ~ SST$Jan)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4773 -0.1390 -0.0089  0.1401  0.4928 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   7.1117     1.4597    4.87  2.7e-06 ***
## SST$Jan       0.8400     0.0765   10.98  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 0.197 on 153 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared: 0.441,   Adjusted R-squared: 0.437 
## F-statistic:  121 on 1 and 153 DF,  p-value: <2e-16

The output is grouped by the call function, summary of residuals, table of coefficients, residual standard error, $ R^2 $ values, and the $ F $-statistic and associated $ p $-value. The output starts with a repeat of the function call. This is useful if the object is saved and examined later.

Summary statistics on the set of model residuals follow. A residual is the difference between the response value at a particular explanatory value and the modeled value. The distribution of residuals is an important diagnostic in evaluating how well the model fits your data. The average of the residuals is zero by definition of least squares, so the median should not be far from zero and the minimum and maximum should roughly be equal in absolute magnitude. Similar with the first 1Q and third 3Q) quartile values.

Next, the table of coefficients shows the intercept and slope values as in the raw output, but here the accompanying standard errors, $ t $ values, and $ p $-values are also provided. The output is in tabular form with the each row corresponding to a separate parameter. The first row is the intercept parameter and the second row is the slope parameter associated with the explanatory variable. Don't be confused about this labeling. The vector SST$Jan is the explanatory variable and the coefficient value is the amount of change in the mean response associated with a unit change in the explanatory variable.

The first labeled column is the sample estimate (Estimate), the second is the standard error (Std. Error), the third is the $ t $ value (t value) and the fourth is the $ p $-value (Pr(>|t|))). Note that the $ t $ value is the ratio of the estimated value to its standard error. The $ p $-value is the probability of finding a $ t $-value as large or larger (in absolute value) by chance assuming the estimate is zero. Here the $ p $-value on the January SST coefficient is near zero indicating no relationship between January SST and August SST given your sample of data.

By default, symbols are placed to the right of the $ p $-values as indictors of the level of significance and a line below the table provides the definition. Here we turned them off using an argument in the options() function. In your scientific reporting the $ p $-value itself should always be reported rather than a categorical significance level.

The interpretation of a $ p $-value as evidence in support of the null hypothesis is the same as you encountered earlier. Your job is simply to determine the null hypothesis. In the context of regression, the assumption of no relationship between the explanatory and response variable, is typically your null hypothesis. Therefore, a low $ p $-value indicates evidence of a relationship between your explanatory and response variables.

Continuing with the extracted summary of the model object, the residual standard error quantifies the variation of the observed values about the regression line. It is computed as the square root of the sum of the squared residuals divided by the square root of the degrees of freedom. The degrees of freedom is the sample size minus the number of coefficients. It provides an estimate of the model parameter $ \sigma $.

Next are the R-squared values. The 'multiple R squared,' is the proportion of variation in the response variable that can be explained by the explanatory variable. So here you state that the model explains a percentage of the variation in August SST values. With only a single explanatory variable (simple linear regression) the multiple R squared is equal to the square of the Pearson correlation coefficient.

The adjusted R squared ($ \bar R^2 $) is a modification to the multiple R squared for the number of explanatory variables. The adjusted R squared increases only if the new variable improves the model more than would be expected by chance. It can be negative, and will always be less than or equal to the multiple R squared. It is defined as \[ \bar R^2 = 1-(1-R^{2}){n-1 \over n-p-1} \] where $ n $ is the sample size and $ p $ is the number of explanatory variables. In small samples with many explanatory variables, the difference between $ R^2 $ and $ \bar R^2 $ will be large.

The final bit of output is related to an $ F $ test, which is a test concerning the entire model. The output includes the $ F $ statistic, the degrees of freedom (in this case two of them) and the corresponding $ p $-value as evidence in support of the null hypothesis that the model has no explanatory power. In the case of simple regression, it is equivalent to the test on the slope parameter so it is only interesting when there is more than one explanatory variable. Note that the $ F $ statistic is equal to the square of the $ t $ statistic, which is true of any linear regression model with one explanatory variable.

The function resid() takes a model object and extracts the vector of residual values. For example, type

lrm = lm(Aug ~ Jan, data = SST)
resid(lrm)[1:10]

##        6        7        8        9       10       11       12       13 
## -0.06288 -0.26904  0.02784  0.08924 -0.14260  0.21648 -0.32968 -0.26140 
##       14       15 
##  0.43564 -0.20316

First the model object is saved with name lrm. Here only the column names are referenced in the model formula because you specify the data frame with the data argument. Then the extractor function resid() lists the residuals. Here using the subset function you list only the first ten residuals.

Similarly the function fitted() computes the mean response value for each value of the explanatory variable. For example, type

fitted(lrm)[1:10]

##     6     7     8     9    10    11    12    13    14    15 
## 23.19 23.19 22.82 22.90 23.11 23.01 22.99 22.89 22.85 23.18

These fitted values lie along the regression line.

Note that the residuals and fitted values are labeled with the row numbers of the SST data frame. In particular, note that they do not contain rows 1 through 5, which are missing in the response and explanatory variable columns.

A useful application for a statistical model is predicting for new values of the explanatory variable. The predict() function is similar to the fitted() function, but allows you to predict values of the response for arbitrary values of the explanatory variable. The caveat is that you need to specify the explanatory values as a data frame using the newdata argument. For example, to make a SST prediction for August given a value of 19.4C in January, type

predict(lrm, newdata = data.frame(Jan = 19.4))

##     1 
## 23.41

The word 'predictor' is the generic term for an explanatory variable in a statistical model. A further distinction is sometimes made between covariates, which are continuous-valued predictors and factors, which can take on only a few values that may or may not be ordered.

A prediction is not worth much without an estimate of uncertainty. A predicted value from a statistical model has two sources of uncertainty. One is the uncertainty about the mean of the response conditional on the value of the explanatory variable. Like the standard error of the mean, it is the precision with which the conditional mean is known. It is known as a confidence interval. To obtain the confidence interval on the predicted value, type

predict(lrm, data.frame(Jan = 19.4), int = "c")

##     fit   lwr   upr
## 1 23.41 23.35 23.47

The argument int=“c” tells the extractor function predict() to provide an confidence interval on the predicted value. The output includes the predicted value in the column labeled fit() and the lower and upper confidence limits in the columns lwr and upr, respectively. By default the limits define the 95% CI. This can be changed with the level argument.

Given the data and the model there is a 95% chance that the interval defined by the limits will cover the true (population) mean when the January SST value is 19.4C. The other source of uncertainty arises from the distribution of a particular value given the conditional mean. That is, even if you know the conditional mean exactly, the distribution of particular values about the mean will have a spread.

The prediction interval provides a bound on a set of new values from the model that contains both sources of uncertainty. As a consequence, for a given confidence level, the prediction interval will always be wider than the confidence interval. The prediction interval relies on the assumption of normally distributed errors with a constant variance across the values of the explanatory variable.

To obtain the prediction interval on the predicted value, type

predict(lrm, data.frame(Jan = 19.4), int = "p")

##     fit   lwr  upr
## 1 23.41 23.01 23.8

Given the data and the model there is a 95% chance that the interval defined by these limits will cover any future value of August SST given that the January SST value is 19.4C.

Multiple Linear Regression

Multiple regression extends simple linear regression by allowing more than one explanatory variable. Everything from simple regression carries over. Each additional explanatory variable contributes a new term to the model. However, an issue now arises because of possible relationships between the explanatory variables.

As an illustration, we continue with a model for predicting August SST values over the North Atlantic using SST values from earlier months. Specifically for this example you are interested in making predictions with the model at the end of March. You have January, February, and March SST values plus Year as the set of explanatory variables.

The first step is to plot your response and explanatory variables. This is done with the pairs() function. By including the panel.smooth() function as an argument to panel a local smoother is used on the set of points that allows you to more easily see possible relationships. Here you specify the August values (column 9 in SST) to be plotted in the first row (and column) followed by year and then the January through March values.

labs = c("Aug SST [C]", "Year", "Jan SST [C]", "Feb SST [C]", "Mar SST [C]")
pairs(SST[, c(9, 1:4)], panel = panel.smooth, labels = labs)

plot of chunk pairsFunction

The scatter plots are arranged in a two-dimensional matrix. The response variable is August SST and the four explanatory variables include Year, and the SST values during January, February, and March. A locally-weighted polynomial smoother with a span of 67% of the points is used to draw the red lines.

The diagonal elements of the matrix are the variable labels. The plot in the first row and second column is the August SST values on the vertical axis and the year on the horizontal axis. The plot in row one column three is the August SST values on the vertical axis and January SST values on the horizontal axis, and so on.

The plots draw your attention to what explanatory variables might be important in a model. Here all relationships are positive. Specifically, August SST values increase with increasing year and increasing January through March SST values. Based on these bivariate relationships you might expect that all four explanatory variables, with the exception of perhaps Year, will be important in the model of August SST.

The plots also show the relationships between the covariates. Here you see a tight linear relationship between each month's SST values. This warrants attention as a model that includes all three SST will contain a large amount of redundant information. Information contained in the February SST values is about the same as the information contained in the January SST values and the information contained in the March SST values is about the same as the information contained in the February SST values.

A similar set of plots is available using the ggpairs() function from the GGally package.

require(GGally)

## Warning: there is no package called 'GGally'

ggpairs(SST[SST$Year > 1855, ], columns = c("Year", "Jan", "Feb", "Mar", "Aug"))

## Error: could not find function "ggpairs"

To fit a multiple regression model to these data with August SST as the response variable, type

m1 = lm(Aug ~ Year + Jan + Feb + Mar, data = SST)

Then to examine the model coefficients, type

summary(m1)

## 
## Call:
## lm(formula = Aug ~ Year + Jan + Feb + Mar, data = SST)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4189 -0.1128 -0.0125  0.1268  0.4744 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.174864   1.315436    4.69  6.0e-06 ***
## Year         0.000703   0.000371    1.90     0.06 .  
## Jan          0.185606   0.166700    1.11     0.27    
## Feb         -0.117448   0.218748   -0.54     0.59    
## Mar          0.764921   0.161058    4.75  4.7e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 0.175 on 150 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared: 0.565,   Adjusted R-squared: 0.553 
## F-statistic: 48.6 on 4 and 150 DF,  p-value: <2e-16

As expected March SST is positively related to August SST, and significantly so. Year is also positively related to August SST. The Year term has a $ p $ value on its coefficient that is marginally significant (suggestive, but inconclusive) against the null hypothesis of a zero trend. However, you can see that the coefficient on January SST is positive, but not statistically significant, and the coefficient on February SST is negative.

You can see there is a strong positive relationship between February SST and August SST, so the fact that the relationship is negative in the context of multiple regression indicates a problem. The problem stems from the correlation between explanatory variables (multicollinearity). Correlation values higher than 0.6 can result in an unstable model because the standard errors on the coefficients are not estimated with enough precision.

As long as multicollinearity is not perfect, estimation of the regression coefficients is possible but estimates and standard errors become sensitive to even the slightest change in the data. Prior understanding of the partial correlation (here the correlation between February SST and August SST controlling for March SST) may help argue in favor of retaining two highly-correlated explanatory variables, but in the usual case it is better to eliminate the variable whose relationship with the response variable is harder to explain physically or that has the smaller correlation with the response variable.

Here February SST has a smaller correlation with August SST, so you remove it and reestimate the coefficients. You create a new linear model object and summarize it by typing

m2 = lm(Aug ~ Year + Jan + Mar, data = SST)
summary(m2)

## 
## Call:
## lm(formula = Aug ~ Year + Jan + Mar, data = SST)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4318 -0.1136 -0.0003  0.1214  0.4781 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.203182   1.311277    4.73  5.1e-06 ***
## Year        0.000681   0.000367    1.85    0.066 .  
## Jan         0.128790   0.128501    1.00    0.318    
## Mar         0.706370   0.118242    5.97  1.6e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 0.175 on 151 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared: 0.564,   Adjusted R-squared: 0.555 
## F-statistic: 65.1 on 3 and 151 DF,  p-value: <2e-16

The remaining explanatory variables all have a positive relationship with the response variable, consistent with their bivariate plots, but the coefficient on January SST is not statistically significant.

Thus it is necessary for you to try a third model with this term removed.

m3 = lm(Aug ~ Year + Mar, data = SST)
summary(m3)

## 
## Call:
## lm(formula = Aug ~ Year + Mar, data = SST)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4289 -0.1103 -0.0045  0.1162  0.4740 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.769020   1.183533    5.72  5.5e-08 ***
## Year        0.000758   0.000359    2.11    0.036 *  
## Mar         0.799866   0.072656   11.01  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 0.175 on 152 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared: 0.561,   Adjusted R-squared: 0.555 
## F-statistic: 97.1 on 2 and 152 DF,  p-value: <2e-16

The model makes sense. August SST values are higher when March SST values are higher and vice versa. This relationship holds after accounting for the upward trend in August SST values.

Note that the order of the explanatory variables on the right side of the ~ does not matter. That is, you get the same output by typing

summary(lm(Aug ~ Mar + Year, data = SST))

## 
## Call:
## lm(formula = Aug ~ Mar + Year, data = SST)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4289 -0.1103 -0.0045  0.1162  0.4740 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.769020   1.183533    5.72  5.5e-08 ***
## Mar         0.799866   0.072656   11.01  < 2e-16 ***
## Year        0.000758   0.000359    2.11    0.036 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 0.175 on 152 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared: 0.561,   Adjusted R-squared: 0.555 
## F-statistic: 97.1 on 2 and 152 DF,  p-value: <2e-16

The multiple R-squared value is slightly lower in the final model with fewer variables. This is always the case. R squared cannot be used to make meaningful comparison of models with different numbers of explanatory variables. The adjusted R squared can be used to make comparisons as it increases when a term is added to a model only if the term is statistically significant.

The final model is checked for adequacy by examining the distribution of model residuals. The five number summary of the residuals given as part of the summary output gives you no reason to suspect the assumption of normally distributed residuals. However, the residuals are likely to have some autocorrelation violating the assumption of independence.

Choosing Predictors

Suppose $ H $ is your variable of interest, and $ X_1, \ldots, X_p $ a set of potential explanatory variables or predictors, are vectors of $ n $ observations. The problem of predictor selection arises when you want to model the relationship between $ H $ and a subset of $ X_1, \ldots, X_p $, but there is uncertainty about which subset to use. The situation is particularly interesting when $ p $ is large and $ X_1, \ldots, X_p $ contains redundant and irrelevant variables.

You want a model that fits the data well and which produces forecasts with small variability. The problem is these two goals are in conflict. An additional predictor in a model will improve the fit (reduce the bias), but will increase the variance due to a loss in the number of degrees of freedom. This is an example of the bias-variance trade-off.

The Akaike Information Criterion (AIC) is a statistic that helps you decide. \[ \mathrm{AIC} = 2(p+1) +n[\log(\mathrm{SSE}/n)], \] where $ p $ is the number of predictors and SSE is the residual sum of squares. You can compare the AIC values when each predictor is added or removed from a given model. For example, if after adding a predictor, the AIC value for the model increases then the trade-off is in favor of the extra degree of freedom and against retaining the predictor.

Returning to your original model for August SST, the model is saved in the object m1. The drop1() function takes your regression model object and returns a table showing the effects of dropping, in turn, each variable from the model. To see this, type

drop1(m1)

## Single term deletions
## 
## Model:
## Aug ~ Year + Jan + Feb + Mar
##        Df Sum of Sq  RSS  AIC
## <none>              4.61 -535
## Year    1     0.111 4.72 -533
## Jan     1     0.038 4.65 -536
## Feb     1     0.009 4.62 -537
## Mar     1     0.693 5.30 -515

Here the full model (all four covariates) has a residual sum of squares (RSS) of 4.61 (in the row; none dropped). If you drop the Year variable, the RSS increases to 4.72 and you gain one degree of freedom. That is too much increase in RSS for the gain of only a single degree of freedom, thus the AIC increases to 4.72. You conclude Year is too important to drop from the model. This is true of March SST, but not of January or February SST.

Therefore to help you choose variables you compare the AIC values for each variable against the AIC value for the full model. If the AIC value is less than the AIC for the full model then the trade-off favors removing the variable from the model. If you repeat the procedure after removing the January and February SST variables you will conclude that there is no statistical reason to make the model simpler.

Stepwise regression automates drop1(). It also can be done using forward selection. The AIC is used as a criterion for choosing or deleting a variable and for stopping. To see how this works with your model, type

step(m1)

## Start:  AIC=-534.9
## Aug ~ Year + Jan + Feb + Mar
## 
##        Df Sum of Sq  RSS  AIC
## - Feb   1     0.009 4.62 -537
## - Jan   1     0.038 4.65 -536
## <none>              4.61 -535
## - Year  1     0.111 4.72 -533
## - Mar   1     0.693 5.30 -515
## 
## Step:  AIC=-536.6
## Aug ~ Year + Jan + Mar
## 
##        Df Sum of Sq  RSS  AIC
## - Jan   1     0.031 4.65 -538
## <none>              4.62 -537
## - Year  1     0.105 4.72 -535
## - Mar   1     1.092 5.71 -506
## 
## Step:  AIC=-537.5
## Aug ~ Year + Mar
## 
##        Df Sum of Sq  RSS  AIC
## <none>              4.65 -538
## - Year  1      0.14 4.79 -535
## - Mar   1      3.71 8.36 -449

## 
## Call:
## lm(formula = Aug ~ Year + Mar, data = SST)
## 
## Coefficients:
## (Intercept)         Year          Mar  
##    6.769020     0.000758     0.799866

The output is a series of tables showing the RSS and AIC values with successive variables removed. The default method is backward deletion, which amounts to a successive application of the drop1() function. It's a good strategy to try the other selection methods to see if the results are the same. They may not be. If they are you will have greater confidence in your final model.

Cross-Validation

Cross-validation is needed if your statistical model will be used to make actual forecasts. Cross-validation is a procedure to assess how well your scheme will do in forecasting the unknown future. In the case of independent hurricane seasons, it involves withholding a season's worth of data, developing the algorithm on the remaining data, then using the algorithm to predict data from the season that was withheld. Note that if your algorithm involves stepwise regression or machine learning, then the predictor selection component must be part of the cross validation. That is, after removing a season's worth of data, you must run your selection procedure and then make a single-season prediction using the final model(s). And this needs to be done for each season removed.

The result of cross-validation is an estimate of out-of-sample error that accurately estimates the average forecast error when the model is used in predicting the future. The out-of-sample prediction error is then compared with the prediction error computed out of sample using a simpler model. If the error is larger with the simpler model, then your model is considered skillful. Note that the prediction error from the simpler model, even if it is long-run climatology, also needs to be obtained using cross-validation.

In this chapter we reviewed classical statistics with examples from hurricane climatology. Topics included descriptive statistics, probability and distributions, one- and two-sample tests, statistical formula in R, correlation, and regression. Next we give an introduction to Bayesian statistics.