Chapter 3: Classical Statistics =============================== "The difference between 'significant' and 'not significant' is not itself statistically significant."---Andrew Gelman ```{r datePackages, message=FALSE} #setwd("~/Dropbox/Book/Chap03") date() require(MASS) require(ellipse) require(boot) ``` Summary Statistics, Distributions, Significance Tests, & Correlation -------------------------------------------------------------------- ### Mean, Median, and Maximum The data set H.txt is a list of hurricane counts by year making land fall in the United States (excluding Hawaii). To input the data and save them as a data object type ```{r inputUShurricaneData} con = "http://hurricaneclimatology.squarespace.com/storage/chapter-3/H.txt" H = read.table(con, header=TRUE) ``` You can obtain the mean and variance by typing ```{r meanVariance} mean(H$All); var(H$All) ``` The semicolon acts as a return so you can place multiple commands on the same text line. The values printed on your screen have too many digits. Since there are only `r I(length(H$All))` years, the number of significant digits is 2 or 3. This can be changed using the signif() function. ```{r significantFigures} signif(mean(H$All), digits=3) ``` Notice the nesting. Here the mean() function is nested inside the signif() function. You can set the digits globally using the options() function. The median, standard deviation, maximum, and minimum are obtained by typing ```{r statistics} median(H$All); sd(H$All); max(H$All); min(H$All) ``` Higher order moments like skewness and kurtosis are available in the **moments** and **psych** packages. ```{r getPsychPackage, message=FALSE} require(psych) skew(H$All) ``` A symmetric distribution has a skewness of zero. To determine which year has the maximum you first test each year's count against the maximum using the logical operator ==. This provides a list of years for which the operation returns a TRUE. You then subset the hurricane year according to this logical list. ```{r whichMost} maxyr = H$All == max(H$All) maxyr H$Year[maxyr] ``` To find the years with no hurricanes, type ```{r noHurricanes} H$Year[H$All == min(H$All)] ``` How many years have no hurricanes? ```{r howManyWithNohurricanes} sum(H$All == 0) ``` The sum() function counts a TRUE as 1 and a FALSE as 0 so the result tells you how many years have a count of zero hurricanes. What is the longest streak of years without a hurricane? To answer this, first you create an ordered vector of years with at least one hurricane. Next you use the diff() function to take differences between sequential years given in the ordered vector. Finally, you find the maximum of these differences minus one. ```{r noStrikeStreak} st = H$Year[H$All > 0] max(diff(st) - 1) ``` Thus the longest streak without hurricanes is only `r I(max(diff(st)-1))` years. Or use the rle() function (run length encoding) to compute the length and values of runs in a vector and then table the results. ```{r runLengths} st = H$All == 0 table(rle(st)) ``` The results show `r I(table(rle(st))[2,2])` sets of two consecutive years without a hurricane and `r I(table(rle(st))[3,2])` set of three consecutive years without a hurricane. ### Quantiles Import the NAO data file. ```{r readNAOdata} con = "http://hurricaneclimatology.squarespace.com/storage/chapter-3/NAO.txt" NAO = read.table(con, header=TRUE) ``` Obtain quantiles of the June NAO values by typing ```{r NAOquantiles} nao = NAO$Jun quantile(nao) ``` By default you get the minimum, the maximum, and the three quartiles (the 0.25, 0.5, and 0.75 quantiles). To obtain other quantiles you include the prob argument in the function. To find the 19th, 58th, and 92nd percentiles of the June NAO values, type ```{r otherQuantiles} quantile(nao, prob=c(.19, .58, .92)) ``` Be aware that there are different ways to compute quantiles. Details can be found in the documentation by typing ```{r quantileHelp, eval=FALSE} help(quantile) ``` ### Missing Values The way R handles missing values depends on the context. With a vector of values, some of which are missing and marked with NA, the summary() function computes statistics and returns the number of missing values. ```{r summarizeAugSST} con = "http://hurricaneclimatology.squarespace.com/storage/chapter-3/SST.txt" SST = read.table(con, header=TRUE) sst = SST$Aug summary(sst) ``` There are five years with missing values during August. The summary statistics are computed by removing the missing values. A function like the mean() applied to a vector with missing values returns an NA. ```{r meanSST} mean(sst) ``` In this case R does not remove the missing values unless requested to do so. The mean of a vector with an unknown value is unknown. To remove the missing values, type ```{r meanSSTnarm} mean(sst, na.rm=TRUE) ``` One exception is the length() function. Use the is.na() function, which returns a TRUE for a missing value and FALSE otherwise. Then use sum() to count the number of TRUEs. For your August SST data type ```{r lengthWithNAs} sum(is.na(sst)) ``` The number of non-missing data values are obtained by using the logical negation operator ! (read as 'not'). For example, type ```{r nonMissing} sum(!is.na(sst)) ``` This tells you there are `r I(sum(!is.na(sst)))` years with August SST values. ### Random Samples Probability theory arose from studying games of chance like rolling dice and picking cards at random. Randomness and probability are central to statistics. You can simulate games of chance with the sample() function. To pick four years at random from the 1990s you type ```{r sampleFunction} sample(1990:1999, size=4) ``` The sequence function : is used to create a vector of ten values representing the years in a decade. The sample() function is then used to pick, at random without replacement, a set of four (size=4) values. This is called a random sample. In deciphering R code it is helpful to read from right to left and from inside to outside. That is start by noting a size of four from a sequence of numbers from 1990 through 1999 and then taking a sample from these numbers. The default is `sample without replacement.' The sample will not contain a repeat year. If you want to sample with replacement, use the argument replace=TRUE. Sampling with replacement is suitable for modeling the occurrence of El Nino (E) and La Nina (L) events. To model the occurrence over ten random seasons type ```{r sampleElNino} sample(c("E", "L"), size=10, replace=TRUE) ``` The idea of random events is not restricted to equal probabilities. For instance, suppose you are interested in the occurrence of hurricanes over Florida. Let the probability be 12% that a hurricane picked at random hits Florida. You simulate 24 random hurricanes by typing ```{r sampleFloridaHits} sample(c("hit", "miss"), size=24, replace=TRUE, prob=c(.12, .88)) ``` The simulated frequency of hits will not be exactly 12%, but the variation about this percentage decreases as the sample size increases according to the law of large numbers. ### Combinatorics What is the probability of randomly drawing a set of three particular years? This is worked out as follows. The probability of getting a particular year (say 1992) as the first one in the sample is 1/10, the next one is 1/9, and the next one is 1/8. Thus the probability of a given sample is 1/(10 x 9 x 8). The prod() function calculates the product of a set of numbers so you get the probability by typing ```{r productFunction} 1/prod(10:8) ``` If you are not interested in the arrangement of years then use the factorial() function and multiply this number by the probability to get ```{r factorial} factorial(3)/prod(10:8) ``` You get the same result using the choose() function. You can check the equality by typing ```{r chooseFunction} factorial(3)/prod(10:8) == 1/choose(10, 3) ``` The double equal signs indicate a logical operator with two possible outcomes (TRUE and FALSE), so a return of TRUE indicates equality. ### Normal Distribution Four items of interest can be calculated for a statistical distribution. * Density or point probability * Cumulative distribution function * Quantiles * Random numbers For each distribution in R there is a function corresponding to each of the four items. The function has a prefix letter indicating which item. The prefix letter d is used for the probability density function, p is used for the cumulative distribution function, q is used for the quantiles, and r is used for random samples. The root name for the normal distribution is norm, so the probability density function for the normal distribution is dnorm(), the cumulative distribution function is pnorm(), the quantile function is qnorm(), and the random sample function is rnorm(). The normal distribution is a family of distributions, where the family members have different parameter values. The family member with $\mu = 0$ and $\sigma^2=1$ is called the standard normal distribution. ```{r normalDistributionPlot, fig.width=4.6, fig.height=3.4} x = seq(-5, 5, .01) par(las=1, mgp=c(2, .4, 0), tcl=-.3) plot(x, dnorm(x, mean=0, sd=1), ylim=c(0, 1), xlab="x", ylab="f(x)", lwd=2, col=1, type="l") lines(x, dnorm(x, mean=0, sd=sqrt(.2)), col=2, lwd=2) lines(x, dnorm(x, mean=0, sd=sqrt(5)), col=3, lwd=2) lines(x, dnorm(x, mean=-2, sd=sqrt(.5)), col=4, lwd=2) ``` The normal distribution is the foundation for many statistical models. It is analytically tractable and arises as the outcome of the *central limit theorem*, which states that the sum of a large number of random variables, regardless of their distribution, is distributed approximately normally. ### Weibull Distribution You can use the density to draw a picture of a distribution. First create a sequence of $x$ values, then plot the sequence along with the corresponding densities from a distribution with values for the parameters. For example, the Weibull distribution provides a good description for wind speeds. To draw its density type ```{r weibullDistributionPlot, fig.width=4, fig.height=3} par(las=1, mgp=c(2.7, .4, 0), tcl=-.3) w = seq(0, 80, .1) plot(w, dweibull(w, shape=2.5, scale=40), type="l", ylim=c(0, .03), ylab="Probability density", lwd=2) ``` The seq() function generates equidistant values in the range from 0 to 80 in steps of 0.1. The distribution has two parameters, the shape and scale. Along with the vector of $w$ values you must specify the shape parameter. The default for the scale parameter is 1. You create a similar plot using the curve() function by typing ```{r weibullCurve, eval=FALSE} curve(dweibull(x, shape=2.5, scale=40), from=0, to=80) ``` The first argument [here dweibull()] must be a function or expression that contains $x$. ### Binomial Distribution In a random set of 24 hurricanes and a probability $p = 12$% that a hurricane picked at random hits Florida (hit rate), the probability that exactly three of them strike Florida [$P(H=3)$] is found by typing ```{r landfallProbs} choose(24, 3) * .12^3 * (1 - .12)^(24 - 3) ``` There is a probability associated with every non-negative integer from zero to 24. You plot the probabilities by typing ```{r landfallProbPlot, fig.width=4, fig.height=3} par(las=1, mgp=c(2.3, .4, 0), tcl=-.3) x = 0:15 plot(x, dbinom(x, 24, .12), ylim=c(0, .3), xlab="Number of Florida hurricanes", ylab="Probability", type="h", lwd=5) ``` The distribution is discrete with probabilities assigned only to counts. The distribution peaks at three and four hurricanes and there are small but non-zero probabilities for the largest counts. ### Poisson Distribution The Poisson distribution is a good description of hurricane counts. To draw a Poisson distribution, type ```{r poissonDistributionPlot, fig.width=4, fig.height=3} par(las=1, mgp=c(2.3, .4, 0), tcl=-.3) h = 0:16 plot(h, dpois(h, lambda=5.6), type="h", ylab="Probability", xlab="Number of hurricanes", lwd=5) ``` The distribution corresponds to the probability of observing h number of hurricanes given a value for the rate parameter. The argument type="h" causes pins to be drawn. ### Cumulative Distributions The cumulative distribution describes the probability less than or equal to a value $x$. The function pnorm() returns the probability of getting a value equal to or smaller than its first argument in a normal distribution with a given mean and standard deviation. Consider the NAO data for June. Assuming these values are described by a normal distribution the chance that a June value is less than -1.5 is gotten by typing ```{r pnormFunctionJuneNAO} pnorm(-1.5, mean=mean(nao), sd=sd(nao)) ``` Thus about `r I(round(pnorm(-1.5, mean=mean(nao), sd=sd(nao)), 2)*100)`% of the June NAO values are less than or equal to -1.5. The annual rate of East coast hurricanes is obtained by typing mean(H$E). This is the rate (lambda) parameter value for the Poisson distribution. So the probability of next year having no East coast hurricane is obtained by typing ```{r probNoEastcoastH} ppois(0, lambda=mean(H$E)) ``` ### Quantile Functions The quantile function is the inverse of the cumulative distribution function. The $p$-quantile is the value such that there is a $p$ probability of getting a value less than or equal to it. The median value is, by definition, the .5 quantile. In tests of statistical significance, the $p$-quantile is usually set at $p$ = 0.05 and is called the $\alpha = p \times 100$% significance level. You are interested in knowing the threshold that a test statistic must cross in order to be considered significant at that level. The $p$-value is the probability of obtaining a value as large, or larger, than the $p$-quantile. Theoretical quantiles are also used to calculate confidence intervals. If you have $n$ normally distributed observations from a population with mean $\mu$ and standard deviation $\sigma$, then the average $\bar x$ is normally distributed around $\mu$ with standard deviation $\sigma/\sqrt{n}$. A 95% confidence interval for $\mu$ is obtained as $$ \bar x + \sigma/\sqrt{n} \times N_{.025} \le \mu \le \bar x + \sigma/\sqrt{n} \times N_{.975} $$ where $N_{0.025}$ and $N_{0.975}$ are the 2.5 and 97.5 percentiles of the standard normal distribution, respectively. You compute a 95% confidence interval about the population mean using the sample of June NAO values by typing ```{r confidenceIntervalCalculations} xbar = mean(nao) sigma = sd(nao) n = length(nao) sem = sigma/sqrt(n) xbar + sem * qnorm(.025) xbar + sem * qnorm(.975) ``` This produces a 95% confidence interval for the population mean of [`r I(round(xbar+sem*qnorm(.025),2))`, `r I(round(xbar+sem*qnorm(.975),2))`]. The left tail value is qnorm(.025) and the right tail value is qnorm(.975). Also, the quantile for the standard normal is written as $\Phi^{-1}(.975)$, where $\Phi$ is the notation for the cumulative distribution function of the standard normal. Thus the confidence interval for the population mean is $$ \bar x \pm \sigma/\sqrt{n} \times \Phi^{-1}(.975) $$ An application of the quantile function is to assess the assumption of normality. This is done by matching empirical quantiles with the quantiles from the standard normal distribution. ### Random Numbers The distribution functions in R can be used to generate random numbers (deviates). The first argument specifies the number of random numbers to generate, and the subsequent arguments are the parameters of the distribution. To generate 10 random numbers from a standard normal distribution, type ```{r randomNormal} rnorm(10) ``` If the seed value is the same the set of random numbers will be the same. ```{r randomNormalseed} set.seed(3042) rnorm(10) ``` Here the set.seed() function uses the default Mersenne Twister RNG with a seed value of 3042. The commands must by typed in the same order; first set the seed, then generate the random numbers. Specifying a RNG and a seed value allows results to be replicated exactly. This is important when your results depend on a method that exploits random values as is the case with some of the Bayesian models. To simulate the next 20 years of Florida hurricane counts based on the counts over the historical record, you type ```{r randomPoisson} rpois(20, lambda=mean(H$FL)) ``` To simulate the maximum wind speed (m s$^{-1}$) from the next 10 tropical cyclones occurring over the North Atlantic, you type ```{r randomWeibul} rweibull(10, shape=2.5, scale=50) ``` ### One-Sample Test The one-sample $t$ (student $t$) test is based on the assumption that your data values ($x_i, \ldots, x_n$) are independent and come from a normal distribution with a mean $\mu$ and variance $\sigma^2$. The shorthand notation is $$ x_i \sim \hbox{iid } N(\mu, \sigma^2) $$ where the abbreviation iid indicates `independent and identically distributed.' You wish to test the *null hypothesis* that $\mu = \mu_0$. You estimate the parameters of the normal distribution from your sample. The average $\bar x$ is an estimate of $\mu$ and the sample variance $s^2$ is an estimate of $\sigma^2$. It's important to keep in mind is that you can never know the true parameter values. In statistical parlance, they are said to be constant, but unknowable. The key concept is that of the *standard error*. The standard error of the mean (or s.e.($\bar x$)) describes the variation in your average calculated from your $n$ values. That is, suppose you had access to another set of $n$ values (from the same set of observations or from the same experiment) and you again compute $\bar x$ from these values. This average will almost surely be different from the average calculated from the first set. The same statistic from different samples taken from the same population will vary. You can demonstrate this for yourself. First generate a population from a distribution by with a fixed mean (3) and standard deviation (4). ```{r saveRandomNormals} X = rnorm(2000, mean=3, sd=4) ``` Next take five samples each with a sample size of six and compute the average. ```{r meanRandomNormals} for(i in 1:5) print(mean(sample(X, size=10, replace=TRUE))) ``` The list of sample means are not all the same and not equal to three. What happens to the variability in the list of means when you increase your sample size from six to 60? What happens when you increase the population standard deviation from four to 14? Try it. The standard error of the mean is $$ \hbox{s.e.}(\bar x) = \frac{\sigma}{\sqrt{n}} $$ where $\sigma$ is the population standard deviation and $n$ is the sample size. Even with only a single sample we can estimate s.e.($\bar x$) by substituting the sample standard deviation $s$ for $\sigma$. The s.e.($\bar x$) tells you how far the sample average may reasonably be from the population mean. With data that are normally distributed there is a 95% probability of the sample average staying within $\mu \pm 1.96\sigma$. It implies that if you take many samples from the population, computing the average for each sample, you will find that 95% of the samples have an average that falls within about two s.e.($\bar x$)s of the population mean. The value of 1.96 comes from the fact that the difference in cumulative probability distribution from the standard normal between $\pm$1.96 is .95. With a bit more code you can verify this for your sample of data saved in the object X. ```{r testConfidenceInterval} set.seed(3042) X = rnorm(2000, mean=3, sd=4) Xb = numeric() for(i in 1:1000) Xb[i] = mean(sample(X, size=6)) p = sum(Xb > 3 - 2 * 4/sqrt(6) & Xb < 3 + 2 * 4/sqrt(6))/1000 p ``` That is, for a given sample, there is a `r I(round(p*100,0))`% chance that the interval defined by the s.e.($\bar x$) will cover the true (population) mean. In the case of a one-sample test, you postulate a population mean and you have a single sample of data. For example, let $\mu_0$ be a guess at the true mean, then you calculate the $t$ statistic as $$ t = \frac{\bar x - \mu_0}{\hbox{s.e.}(\bar x)} $$ With $(\bar x - \mu_0)$ = 2 $\times$ s.e.($\bar x$), your $t$ statistic is two. Your sample mean could be larger or smaller than $\mu_0$ so $t$ can be between $-$2 and $+$2 with $\bar x$ within 2 s.e.($\bar x$)s of $\mu_0$. If you have few data (less than about 30 cases), you need to correct for the fact that your estimate of s.e.($\bar x$) uses the sample standard deviation rather than $\sigma$. By using $s$ instead of $\sigma$ your chance of being farther from the population mean is larger. The correction is made by substituting the $t$-distribution (Student's $t$-distribution) for the standard normal distribution. Like the standard normal the $t$-distribution is continuous, symmetric about the origin, and bell-shaped. It has one parameter called the degrees of freedom ($\nu$) that controls the relative `heaviness' of the tails. The degrees of freedom (df) parameter $\nu = n-1$ where $n$ is the sample size. For small samples, the tails of the $t$-distribution are heavier than the tails of a standard normal distribution, meaning that it is more likely to produce values that fall far from the mean. ```{r normalandstudentt, fig.width=4.6, fig.height=3.4} par(las=1, mgp=c(2, .4, 0), tcl=-.3) leg = c("N(0,1)", "t(9)", "t(1)") curve(dnorm(x), from=-3, to=3, lwd=2, ylab="Probability density") curve(dt(x, 10 - 1), from=-3, to=3, add=TRUE, col=2, lwd=2) curve(dt(x, 2 - 1), from=-3, to=3, add=TRUE, col=2, lwd=1) legend("topright", legend=leg, col=c(1, 2, 2), lwd=c(2, 2, 1)) ``` For instance the difference in cumulative probabilities at the $\pm$1.96 quantile values from a $t$-distribution with 9 df is given by ```{r differenceCumulativeProbabilities} pt(q=1.96, df=9) - pt(q=-1.96, df=9) ``` This probability is smaller than for the standard normal distribution. This indicates that only about 92% of the time the interval between $\pm$1.96 will cover the true mean. Note in the figure how the $t$-distribution approximates the normal distribution as the sample size increases. For sample sizes of 30 or more there is essential no difference. First, given a hypothesized population mean ($\mu_0$), the $t$ statistic is computed from your sample of data. Next the cumulative probability of that value is determined from the $t$-distribution with df equal to sample size minus one. Finally, the probability is multiplied by two to obtain the $p$-value. A small $p$-value leads to a rejection of the null hypothesis and a large $p$-value leads to a failure to reject the null hypothesis. The $p$-value is an estimate of the probability that a particular result, or a result more extreme than the result observed, could have occurred by chance if the null hypothesis is true. In the present case, if the true mean is $\mu_0$ what is the probability that your sample mean is as far or farther from $\mu_0$ as it is? In short, the $p$-value is a measure of the credibility of the null hypothesis. The higher the $p$-value the more credible the null hypothesis appears given your sample of data. But, the $p$-value is best interpreted as evidence *against* the null hypothesis, thus a small value indicates evidence to reject the null. The interpretation is not black and white. A convenient way to express the evidence is given in this table. The $p$-value as evidence against the null hypothesis. $p$-value | evidence ---------- | ---------- 0--0.01 | convincing 0.01--0.05 | moderate 0.05--0.15 | suggestive, but inconclusive $>$ 0.15 | no Consider the area-averaged North Atlantic sea-surface temperature (SST) values each August (C) as an example. Input the monthly data and save the values for August in a separate vector by typing ```{r getSST} con = "http://hurricaneclimatology.squarespace.com/storage/chapter-3/SST.txt" SST = read.table(con, header=TRUE) sst = SST$Aug ``` Begin with a summary table. ```{r summaryTableSST} summary(sst) ``` You might be interested in the hypothesis that the values deviate significantly from 23.2C. The task for you is to test whether this distribution has a mean $\mu=23.2$. Assuming the normal distribution is an adequate model for your data, this is done using the t.test() function as follows. ```{r tTestSST} t.test(sst, mu=23.2) ``` The output begins with a description of the test followed by the name of the data object used. The next line contains the value of the $t$-statistic, degrees of freedom, and the $p$-value. The df (degrees of freedom) is the number of values in calculating the $t$-statistic that are free to vary. Here it is equal to the number of years (number of independent pieces of information) that go into calculating the $t$-statistic minus the number of statistics used in the intermediate steps of the calculation. The $p$-value is `r I(round(t.test(sst, mu=23.2)$p.value,3))` indicating conclusive evidence against the null hypothesis that the mean is 23.2. Said another way, if the true mean is 23.2, then the sample of data we have is very unusual. A sentence regarding the alternative hypothesis is printed next. It has two pieces of information; the value corresponding to your hypothesis and whether your test is one- or two-sided. Here it states 'not equal to', indicating a two-sided test. You specify a one-sided test against the alternative of a larger $\mu$ by using the alternative="greater" argument. For instance you might hypothesize the temperature to exceed a certain threshold value. Abbreviated argument names often work. For example, here it is okay to write alt="g" to get the one-sided, greater than, alternative. The next output is the 95% confidence interval for the true mean. You can think of it as defining a set of hypothetical mean values, such that if they were used as values for your null hypothesis (instead of 23.2) they would lead to a $p$-value of 0.05 or greater (failure to reject the null). You can specify a different confidence level with the conf.level argument. For example, conf.level=.99 will give you a 99% interval. The final bit of output is the mean value from your sample. It is the best single estimate for the true mean. Note that you are not testing the data. You are testing a hypothesis about some hypothetical value using your data. In classical inference you state a hypothesis that, for example, the population mean has a value equal to $\mu_0$. You then use your data to see if there is evidence to reject it. The evidence is summarized as a $p$-value. The hypothesis is a model (normal distribution centered on mu). A $p$-value less than 0.15 is taken as suggestive, but inconclusive evidence that the hypothesis is wrong, while a $p$-value less than 0.01 is convincing evidence it is wrong. ### Wilcoxon Signed-Rank Test The Wilcoxon signed-rank test (Mann-Whitney U-test) is a non-parametric alternative to the $t$ test. The hypothesized mean ($\mu_0$) is subtracted from each observation. Ranks are assigned to each difference with the smallest difference given a rank of one. Then, the set of absolute magnitudes of each of the differences are ranked. The function rank() is used to obtain ranks from a set of values. For example, type ```{r rankFunction} rank(c(2.1, 5.3, 1.7, 1.9)) ``` The function returns the ranks with each value assigned a ranking from lowest to highest. Here the value of 2.1 in the first position of the data vector is ranked third and the value of 1.7 in the fourth position is ranked one. The ranks of the first 18 differences between the observed and null hypothesis value are ```{r rankedDifferences} x = sst - 23.2 r = rank(abs(x)) r[1:18] ``` This says there are `r I(r[1])` years that have a difference (absolute value) less than or equal to the first year's SST value. By default ties are handled by averaging the ranks so for an even number of ties the rank are expressed as a fractional half, otherwise they are a whole number. The test statistic V is the sum of the ranks corresponding to the values that are above the hypothesized mean ```{r sumRanks} sum(r[x > 0], na.rm=TRUE) ``` Assuming only that the distribution is symmetric around $\mu_0$, the test statistic corresponds to selecting each rank from 1 to $n$ with probability of .5 and calculating the sum. The distribution of the test statistic can be calculated exactly. For large samples the distribution is approximately normal. The application of the Wilcoxon test is done in the same way as the application of the $t$ test. You specify the data values in the first argument and the hypothesized population mean in the second argument. ```{r wilcoxTest} wilcox.test(sst, mu=23.2) ``` The $p$-value indicates convincing evidence against the null hypothesis. If the assumptions are met, then the $t$ test will be more efficient by about 5% relative to the Wilcoxon test. That is, for a given sample size, the $t$ test better maximizes the probability that the test will reject the null hypothesis when it is false. That is, the $t$ test has more power than the Wilcoxon test. The Wilcoxon test has problems when there are ties in the ranks for small samples. By default (if exact is not specified), an exact $p$-value is computed if the samples contain less than 50 values and there are no ties. Otherwise a normal approximation is used. ### Two-Sample Test The two-sample $t$ test is used to test the hypothesis that two samples come from distributions with the same population mean. The theory is the same as that employed in the one-sample test. Vectors are now doubly indexed ($x_{1,1}, \ldots, x_{1,n_1}$ and $x_{2,1}, \ldots, x_{2,n_2}$). The first index identifies the sample and the second identifies the case. The assumption is that normal distributions $N(\mu_1, \sigma_1^2)$ and $N(\mu_2, \sigma_2^2)$ adequately describe your values and your interest is to test the null hypothesis $\mu_1 = \mu_2$. You calculate the $t$-statistic as $$ t = \frac{\bar x_1 - \bar x_2}{\sqrt{\hbox{s.e.}(\bar x_1)^2+\hbox{s.e.}(\bar x_2)^2}} $$ where the denominator is the standard error of the difference in means and $\hbox{s.e.}(\bar x_i) = s_i/\sqrt{n_i}$. If you assume the two samples have the same variance ($s_1^2 = s_2^2$) then you calculate the s.e.($\bar x$)s using a single value for $s$ based on the standard deviation of the values over both samples. Under the null hypothesis that the population means are the same, the $t$-statistic will follow a $t$-distribution with $n_1+n_2-2$ degrees of freedom. If you don't assume equal variance the $t$-statistic is approximated by a $t$-distribution after adjusting the degrees of freedom by the Welch procedure. By default the t.test() function uses the Welch procedure resulting in a non-integer degrees of freedom. Regardless of the adjustment, the two-sample test will usually give about the same result unless the sample sizes and the standard deviations are quite different. As an example, suppose you are interested in whether the June NAO values have mean values that are different depending on hurricane activity along the Gulf coast later in the year. ```{r readData2} con = "http://hurricaneclimatology.squarespace.com/storage/chapter-3/H.txt" H = read.table(con, header=TRUE) con = "http://hurricaneclimatology.squarespace.com/storage/chapter-3/NAO.txt" NAO = read.table(con, header=TRUE) nao = NAO$Jun ``` First create two samples of NAO values. The first sample contains June NAO values in years with no Gulf hurricanes and the second sample contains NAO values in years with at least two Gulf hurricanes. This is done using the subset() function. ```{r NAOvectors} nao.s1 = subset(nao, H$G == 0) nao.s2 = subset(nao, H$G >= 2) ``` Summarize the values with the mean, standard deviation, and sample size. ```{r summarizeNAOvalues} mean(nao.s1); sd(nao.s1); length(nao.s1) mean(nao.s2); sd(nao.s2); length(nao.s2) ``` The mean NAO value is larger during inactive hurricane years but is it significantly larger? The standard deviation is also larger and so is the sample size. Your null hypothesis is that the population mean of June NAO values during active Gulf years is equal to the population mean of June NAO values during inactive years. The test is performed with the t.test() function with the data vectors as the two arguments. By default it uses the Welch procedure. ```{r twosampleTtest} t.test(nao.s1, nao.s2) ``` The output is similar as before. Assuming the null hypothesis of no difference in population means is correct, this $t$ value (or a value larger in magnitude) has a small probability of occurring by chance providing compelling evidence that June NAO values are different between the two samples. As with the one-sample test the alternative hypothesis, which is that the true difference in means is not equal to zero, is stated as part of the output. This is the most common alternative in these situations. The confidence interval (CI) refers to the difference in sample means (mean from sample 1 minus mean from sample 2). If you are willing to assume the variances are equal (for example both samples come from the same population), you can specify the argument var.equal=TRUE. In this case the number of degrees of freedom is a whole number, the $p$-value is larger, and the confidence interval wider. ### Statistical Formula While you might consider the data in separate vectors, this is not the best way to do things in R. Instead of creating subsets of the object nao based on values in the object H, you create a data frame with two parallel columns. Include all values for the NAO in one column and the result of a logical operation on Gulf hurricane activity in a separate column. ```{r Gulfhurricanes} gulf = H$G > 1 nao.df = data.frame(nao, gulf) tail(nao.df) ``` This shows the NAO values and whether or not there was two or more Gulf hurricanes in corresponding years. The goal is to see whether there is a shift in level of the NAO between the two groups of hurricane activity years (TRUE and FALSE). Here the groups are years with two or more Gulf hurricanes (TRUE) and years with one or fewer hurricanes (FALSE). With this setup you specify a two-sample $t$-test using the tilde (~) operator as ```{r tTestNAO} t.test(nao ~ gulf, data=nao.df) ``` The object to the left of the tilde is the variable you want to test and the object to the right is the variable used for testing. The tilde is read as 'described by' or 'conditioned on.' That is, the June NAO values are described by Gulf coast hurricane activity. This is how statistical models are specified in R. You will see this model structure throughout the book. Note that by using the data=nao.df you can use the column vectors in the data frame by name in the model formula. The conclusion is the same. Years of high and low Gulf hurricane activity appear to be presaged by June NAO values that are significantly different. The output is essentially the same although the group names are taken from the output of the logical operation. Here FALSE refers to inactive hurricane years. ### Two-Sample Wilcoxon Test As with the one-sample Wilcoxon test, the two-sample counterpart is based on replacing your data values by their corresponding rank. This is done without regard to group. The test statistic $W$ is then computed as the sum of the ranks in one group. The function is applied using the model structure as ```{r twoSampleWilcoxonTest} wilcox.test(nao ~ gulf, data=nao.df) ``` The results are similar to those found using the $t$-test and are interpreted as convincing evidence of a relationship between late spring NAO index values and hurricane activity along the Gulf coast of the United States. ### Compare Variances It's not necessary to assume equal variances. Indeed this is the default option with t.test(). Yet your interest could be whether the variability is changing. For instance you might speculate that the variability in some metric of hurricane activity will increase with global warming. Note that the variance is strictly positive. Given two samples of data, the ratio of variances will be unity if the variances are equal. Under the assumption of equal population variance the $F$-statistic, as the ratio of the sample variances, has an $F$-distribution with two parameters. The parameters (called the degrees of freedom) are the two samples sizes each minus one. The $F$-distribution is positively skewed. You plot two $F$ distributions having different parameter values by typing ```{r Fdistribution, fig.width=4.4, fig.height=3.3} par(las=1, mgp=c(2, .4, 0), tcl=-.3) leg = c("F(80, 120)", "F(12, 8)") curve(df(x, df1=80, df2=120), from=0, to=3, lwd=2, ylab="Probability density") curve(df(x, df1=12, df2=8), from=-3, to=3, add=TRUE, col=2, lwd=2) legend("topright", legend=leg, col=c(1, 2), lwd=c(2, 2)) ``` Larger sample sizes result in a distribution centered on one and symmetric. The function var.test() is called in the same way as the function t.test(), but performs an $F$ test on the ratio of the sample variances. ```{r varianceTest} var.test(nao ~ gulf, data=nao.df) ``` The magnitude of the $p$-value provides suggestive but inconclusive evidence of a difference in population variance. Note that the 95% confidence interval on the $F$-statistic includes the value of one as you would expect given the $p$-value. The $F$-test is sensitive to departures from normality. Also, for small data sets the confidence interval will be quite wide often requiring you to take the assumption of equal variance as a matter of belief. ### Pearson's Product-Moment Correlation Correlation indicates the amount and the direction of association between two variables. If hurricanes occur more often when the ocean is warmer, then you say that ocean temperature is positively correlated with hurricane incidence; as one goes up, the other goes up. Pearson's product-moment correlation coefficient is derived from the bivariate normal distribution of two variables, where the theoretical correlation describes contour ellipses about the two-dimensional densities. It is the workhorse of climatological studies. If both variables are scaled to have unit variance, then a correlation of zero corresponds to circular contours and a correlation of one corresponds to a line segment. The figure below shows two examples one where the variables $x$ and $y$ have a small positive correlation and the other where they have a large negative correlation. The points are generated from a sample of bivariate normal values with a Pearson product-moment correlation of 0.2 and $-$0.7. The contours enclose the 75 and 95% probability region for a bivariate normal distribution with mean of zero, unit variances, and corresponding correlations. You create this figure by typing ```{r correlationscatterplot, message=FALSE, fig.width=6, fig.height=4} require(MASS); require(ellipse) par(mfrow=c(1, 2), pty = "s", las=1, mgp=c(2, .4, 0), tcl=-.3) plot(mvrnorm(100, mu=c(0, 0), Sigma=matrix(c(1, .2, .2, 1), 2, 2)), ylim=c(-4, 4), xlim=c(-4, 4), pch=1, xlab="x", ylab="y") lines(ellipse(.2, level=.75), col="red", lwd=2) lines(ellipse(.2, level=.95), col="red", lwd=2) mtext("a", side=3, line=1, adj=0, cex=1.1) plot(mvrnorm(100, mu=c(0, 0), Sigma=matrix(c(1, -.7, -.7, 1), 2, 2)), ylim=c(-4, 4), xlim=c(-4, 4), pch=1, xlab="x", ylab="y") lines(ellipse(-.7, level=.75), col="red", lwd=2) lines(ellipse(-.7, level=.95), col="red", lwd=2) mtext("b", side=3, line=1, adj=0, cex=1.1) ``` The Pearson correlation coefficient between $x$ and $y$ is $$ r(x,y) = \frac{\sum (x_i - \bar x) (y_i - \bar y)}{\sqrt{\sum (x_i - \bar x)^2 \sum (y_i - \bar y)^2}} $$ The Pearson correlation is often called the 'linear' correlation since the absolute value of $r$ will be one when there is a perfect linear relationship between $x_i$ and $y_i$. The function cor() is used to compute the correlation between two or more vectors. For example to get the linear correlation between the May and June values of the NAO, type ```{r correlationFunction} cor(NAO$May, NAO$Jun) ``` The value indicates weak positive correlation. Note that the order of the vectors in the function is irrelevant as $r(x,y)=r(y,x)$. If there are missing values in $x$ or $y$ type ```{r correlationFunction2} cor(SST$Aug, SST$Sep, use="complete.obs") ``` Here the correlation value indicates strong positive correlation. The statistic $r$ estimated from the data is a random variable and is thus subject to sampling variation. For instance, adding another year's worth of data will result in a value for $r$ that is somewhat different. Typically your hypothesis is that the population correlation between any two variables is zero. As might be guessed from the differences in $r$ your conclusions about this hypothesis will likely be different for the SST and NAO data. You can ask the question differently. For example, in 1000 samples of $x$ and $y$ each of size 30 from a population with zero correlation, what is the largest value of $r$? You answer this question using simulations by typing ```{r correlationSimulation} set.seed(3042) n = 30 cc = numeric() for(i in 1:1000){ x = rnorm(n); y = rnorm(n) cc[i] = cor(x, y) } mean(cc); max(cc) ``` The variable n sets the sample size and you simulate 1000 different correlation coefficients from different independent samples of x and y. The average correlation is close to zero as expected but the maximum correlation is large, perhaps surprisingly. High correlation can arise by chance. Thus when you report a correlation coefficient a confidence interval on your estimate or a test of significance should be included. This is done with the cor.test() function. The test is based on transforming $r$ to a statistic that has a $t$-distribution. Returning to the NAO example, to obtain a confidence interval on the correlation between the May and June values of the NAO and a test of significance, type ```{r correlationTest} cor.test(NAO$May, NAO$Jun) ``` The output also gives a $p$-value as evidence in support of the null hypothesis of no correlation. It also gives a CI for the correlation estimate. Repeat this example using the January and September values of SST. What is the confidence interval on the correlation estimate? How would you describe the evidence against the null hypothesis of zero correlation in this case? ### Spearman's Rank and Kendall's $\tau$ Correlation An alternative is Spearman's rank ($\rho$) correlation, which overcomes the effect of outliers and skewness by considering the rank of the data rather than the magnitude. The Spearman correlation coefficient is defined as the Pearson correlation coefficient between the ranked variables. The Pearson correlation is the default in the cor.test() function. You change this with the method argument. To obtain Spearman's rank correlation and the associated test of significance, type ```{r SpearmanRankCorrelation} cor.test(H$G, H$FL, method="spearman") ``` The results provide suggestive evidence against the null hypothesis of zero correlation. Another correlation is Kendall's $\tau$, which is based on counting the number of concordant and discordant point pairs from your data. For two data vectors $x$ and $y$ each of length $n$, a point at location $i$ is given in two-dimensional space as ($x_i, y_i$). A point pair is defined as [$(x_i, y_i); (x_j, y_j)$] for $i \neq j$. A point pair is concordant if the difference in the $x$ values is of the same sign as the difference in the $y$ values, otherwise it is discordant. The value of Kendall's $\tau$ is the number of concordant pairs minus the number of discordant pairs divided by the total number of unique point pairs, which is $n(n-1)/2$ where $n$ is the sample size. For a perfect correlation, either all point pairs are concordant or all pairs are discordant. Under zero correlation there are as many concordant pairs as discordant pairs. Repeat the call to function cor.test() on coastal hurricane activity, but now use the kendall method. ```{r KendallCorTest} cor.test(H$G, H$FL, method="kendall") ``` ### Bootstrap Confidence Intervals Kendall's tau and Spearman's rank correlations do not come with confidence intervals. You should always report a confidence interval. In this case you use a procedure called 'bootstrapping', which is a resampling technique. The idea is to sample the values from your data with replacement using the sample() function. The sample size is the same as the size of your original data. The sample is called a bootstrap replicate. You then compute the statistic of interest from your replicate. The bootstrap value will be different than the value computed from your data because the replicate contains repeats and not all data are included. You repeat the procedure many times collecting all the bootstrap values. You then use the quantile() function to determine the lower and upper quantiles corresponding to the 0.025 and 0.975 probabilities. The function boot() from the package **boot** generates bootstrap replicates of any statistic applied to your data. It has options for parametric and non-parametric resampling. To implement bootstrapping for Kendall's tau you first create a function as follows. ```{r bootFunction} mybootfun = function(x, i){ Gbs = x$G[i] Fbs = x$FL[i] return(cor.test(Gbs, Fbs, method="k")$est) } ``` Your function has two variables; the data x and an index variable i. Next you generate 1000 bootstrap samples and calculate the confidence intervals by typing ```{r bootstrap, warning=FALSE} require(boot) tau.bs = boot(data=H, statistic=mybootfun, R=1000) boot.ci(tau.bs, conf=.95) ``` The boot() function must be run prior to running the boot.ci() function. The result is a 95% CI about the estimated $\tau$. ### Causation A significant correlation between ocean warmth and hurricane activity does not prove causality. Moreover since the association is symmetric, it does not say that $x$ causes $y$ any more than it says $y$ causes $x$. This is why you frequently hear 'correlation does not equal causation.' The problem with this adage is that it ignores the fact that correlation is needed for causation. It is necessary, but insufficient. When correlation is properly interpreted it is indispensable in the study of hurricanes and climate. Your correlation results are more meaningful if you explain how the variables are physically related. Several different studies showing a consistent correlation between two variables using different time and space scales, and over different time periods and different regions, provide greater evidence of an association than a single study. However, if you want proof that a single factor causes hurricanes, then correlation is not enough. Linear Regression ----------------- Correlation is the most widely used statistic in climatology, but linear regression is arguably the most important statistical model. When you say that variable $x$ has a linear relationship to variable $y$ you mean $y= a + b x $, where $a$ is $y$-intercept and $b$ is the slope of the line. You call $x$ the independent variable and $y$ the dependent variable because the value of $y$ depends on the value of $x$. But in statistics you don't assume these variables have a perfect linear relationship. Instead, in describing the relationship between two random vectors $x_i$ and $y_i$, you add an error term ($\varepsilon$) to the equation such that $$ y_i = \alpha + \beta x_i + \varepsilon_i $$ You assume the values $\varepsilon_i$ are iid $N(0, \sigma^2)$. The slope of the line is the regression coefficient $\beta$, which is the increase in the *average* value of $y$ per unit change in $x$. The line intersects the $y$-axis at the *intercept* $\alpha$. The vector $x$ is called the explanatory variable and the vector $y$ is called the response variable. The equation describes a regression of the variable $y$ *onto* the variable $x$. This is always the case. You regress your response variable onto your explanatory variable(s), where the word 'regression' refers to a model for the mean of the response variable. The model consists of three parameters $\alpha$, $\beta$, and $\sigma^2$. For a set of explanatory and response values, the parameters are estimated using the *method of least squares*. The method finds a set of $\alpha$ and $\beta$ values that minimize the sum of squared residuals given as $$ \hbox{SS}_{res} = \sum_i [y_i - (\alpha+\beta x_i)]^2 $$ The solution to this minimization is a set of equations given by $$ \hat \beta = \frac{\sum(x_i - \bar x) (y_i - \bar y)}{\sum (x_i - \bar x)^2} \\ \hat \alpha = \bar y - \hat \beta \bar x $$ that define estimates for $\alpha$ and $\beta$. The residual variance is $\hbox{SS}_{res} / (n-2)$, where $\hat \alpha$ and $\hat \beta$ are used in Eq.~\ref{eq:sse}. The regression line is written as $$ \hat y_i = \hat \alpha + \hat \beta x_i $$ The method of least squares to determine the $\hat \alpha$ and $\hat \beta$ values is performed by the function lm() (linear model). If you are interested in regressing the August values of Atlantic SST onto the preceding January SST values, you type ```{r lrAugJanSST} lm(SST$Aug ~ SST$Jan) ``` The argument to lm is a model formula where the tilde symbol as we've seen is read as 'described by.' Or to be more complete in this particular case, we state that the August SST values are *described by* the January SST values *using a linear regression model*. In this case you have a single explanatory variable so the formula is simply y~x and you call the model 'simple linear regression'. The response variable is the variable you are interested in modeling. You must decide which variable is your response variable before beginning. Unlike correlation, a regression of $y$ onto $x$ is not the same as a regression of $x$ onto $y$. Your choice depends on the question you want answered. You get no guidance by examining your data nor will R tell you. Here you choose August SST as the response since it is natural to consider using information from an earlier month to predict what might happen in a later month. Parameter estimates are given in the table of coefficients. The estimated intercept value ($\hat \alpha$) is given under Intercept and the estimated slope value ($\hat \beta$) under SST$Jan. The units on the intercept parameter are the same as the units of the response variable, here C. The units on the slope parameter are the units of the response divided by the units of the explanatory variable, here C per C. Thus you interpret the slope value in this example as follows: for every 1C increase in January SST, the August SST increases by `r I(round(lm(SST$Aug ~ SST$Jan)$coefficients[2],2))`C on average. The slope and intercept values will deviate somewhat from the true values due to sampling variation. One way to examine how much deviation is to take many samples from the data and, with each sample, use the lm() function to determine the parameter. The code below does this for the slope parameter using January SST values as the explanatory variable and August SST values as the response. ```{r simulateSlope} sl = numeric() for (i in 1:1000) { id = sample(1:length(SST$Aug), replace=TRUE) sl[i] = lm(SST$Aug[id] ~ SST$Jan[id])$coef[2] } round(quantile(sl), digits=2) ``` You sample from the set of row indices and use the same index for the January and the August values. Results indicate that 50% of the slopes fall between the values `r I(round(quantile(sl)[2],2))` and `r I(round(quantile(sl)[4],2))`C per C. Although illustrative, sampling is not needed. Recall you calculated the s.e.($\bar x$) from a single sample to describe the variability of the sample mean. You do the same to calculate the standard error of the slope (and intercept) from a sample of $x$ and $y$ values. These standard errors, denoted s.e.($\hat \beta$) and s.e.($\hat \alpha$), are used for inference and to compute confidence intervals. Typically the key inference is a test of the null hypothesis that the population value for the slope is zero. This implies the line is horizontal and there is no relationship between the response and the explanatory variable. The test statistic in this case is $$ t = \frac{\hat \beta}{\hbox{s.e.}(\hat \beta)} $$ which follows a $t$ distribution with $n-2$ degrees of freedom if the true slope is zero. Similarly you can test the null hypothesis that the intercept is zero, but this often has little physical meaning because it typically involves an extrapolation outside the range of your $x$ values. The value for the test statistic ($t$ value) is not provided as part of the raw output from lm. The result is a model object. This is a key concept. In Chapter~\ref{chap:Rtutorial} you encountered data objects. You created structured data vectors and input data frames from a spreadsheet. The saved objects are listed in your working session by typing objects(). Functions, like table(), are applied to these objects to extract information. In the same way, a model object contains a lot more information than what is printed. An important extractor function is summary(). You saw previously that applied to a data frame object, the function extracts statistics about the values in each column. When applied to a model object it extracts information about the model. For instance, to obtain the statistics from the regression model of August SST onto January SST, type ```{r summaryLinearModel} summary(lm(SST$Aug ~ SST$Jan)) ``` The output is grouped by the call function, summary of residuals, table of coefficients, residual standard error, $R^2$ values, and the $F$-statistic and associated $p$-value. The output starts with a repeat of the function call. This is useful if the object is saved and examined later. Summary statistics on the set of model residuals follow. A residual is the difference between the response value at a particular explanatory value and the modeled value. The distribution of residuals is an important diagnostic in evaluating how well the model fits your data. The average of the residuals is zero by definition of least squares, so the median should not be far from zero and the minimum and maximum should roughly be equal in absolute magnitude. Similar with the first 1Q and third 3Q) quartile values. Next, the table of coefficients shows the intercept and slope values as in the raw output, but here the accompanying standard errors, $t$ values, and $p$-values are also provided. The output is in tabular form with the each row corresponding to a separate parameter. The first row is the intercept parameter and the second row is the slope parameter associated with the explanatory variable. Don't be confused about this labeling. The vector SST$Jan is the explanatory variable and the coefficient value is the amount of change in the mean response associated with a unit change in the explanatory variable. The first labeled column is the sample estimate (Estimate), the second is the standard error (Std. Error), the third is the $t$ value (t value) and the fourth is the $p$-value (Pr(>|t|))). Note that the $t$ value is the ratio of the estimated value to its standard error. The $p$-value is the probability of finding a $t$-value as large or larger (in absolute value) by chance assuming the estimate is zero. Here the $p$-value on the January SST coefficient is near zero indicating no relationship between January SST and August SST given your sample of data. By default, symbols are placed to the right of the $p$-values as indictors of the level of significance and a line below the table provides the definition. Here we turned them off using an argument in the options() function. In your scientific reporting the $p$-value itself should always be reported rather than a categorical significance level. The interpretation of a $p$-value as evidence in support of the null hypothesis is the same as you encountered earlier. Your job is simply to determine the null hypothesis. In the context of regression, the assumption of no relationship between the explanatory and response variable, is typically your null hypothesis. Therefore, a low $p$-value indicates evidence of a relationship between your explanatory and response variables. Continuing with the extracted summary of the model object, the residual standard error quantifies the variation of the observed values about the regression line. It is computed as the square root of the sum of the squared residuals divided by the square root of the degrees of freedom. The degrees of freedom is the sample size minus the number of coefficients. It provides an estimate of the model parameter $\sigma$. Next are the R-squared values. The 'multiple R squared,' is the proportion of variation in the response variable that can be explained by the explanatory variable. So here you state that the model explains a percentage of the variation in August SST values. With only a single explanatory variable (simple linear regression) the multiple R squared is equal to the square of the Pearson correlation coefficient. The adjusted R squared ($\bar R^2$) is a modification to the multiple R squared for the number of explanatory variables. The adjusted R squared increases only if the new variable improves the model more than would be expected by chance. It can be negative, and will always be less than or equal to the multiple R squared. It is defined as $$ \bar R^2 = 1-(1-R^{2}){n-1 \over n-p-1} $$ where $n$ is the sample size and $p$ is the number of explanatory variables. In small samples with many explanatory variables, the difference between $R^2$ and $\bar R^2$ will be large. The final bit of output is related to an $F$ test, which is a test concerning the entire model. The output includes the $F$ statistic, the degrees of freedom (in this case two of them) and the corresponding $p$-value as evidence in support of the null hypothesis that the model has no explanatory power. In the case of simple regression, it is equivalent to the test on the slope parameter so it is only interesting when there is more than one explanatory variable. Note that the $F$ statistic is equal to the square of the $t$ statistic, which is true of any linear regression model with one explanatory variable. The function resid() takes a model object and extracts the vector of residual values. For example, type ```{r saveLinearRegressionModel} lrm = lm(Aug ~ Jan, data=SST) resid(lrm)[1:10] ``` First the model object is saved with name lrm. Here only the column names are referenced in the model formula because you specify the data frame with the data argument. Then the extractor function resid() lists the residuals. Here using the subset function you list only the first ten residuals. Similarly the function fitted() computes the mean response value for each value of the explanatory variable. For example, type ```{r fittedValues} fitted(lrm)[1:10] ``` These fitted values lie along the regression line. Note that the residuals and fitted values are labeled with the row numbers of the SST data frame. In particular, note that they do not contain rows 1 through 5, which are missing in the response and explanatory variable columns. A useful application for a statistical model is predicting for new values of the explanatory variable. The predict() function is similar to the fitted() function, but allows you to predict values of the response for arbitrary values of the explanatory variable. The caveat is that you need to specify the explanatory values as a data frame using the newdata argument. For example, to make a SST prediction for August given a value of 19.4C in January, type ```{r predictionLinearRegressionModel} predict(lrm, newdata=data.frame(Jan=19.4)) ``` The word 'predictor' is the generic term for an explanatory variable in a statistical model. A further distinction is sometimes made between covariates, which are continuous-valued predictors and factors, which can take on only a few values that may or may not be ordered. A prediction is not worth much without an estimate of uncertainty. A predicted value from a statistical model has two sources of uncertainty. One is the uncertainty about the mean of the response *conditional* on the value of the explanatory variable. Like the standard error of the mean, it is the precision with which the conditional mean is known. It is known as a *confidence interval*. To obtain the confidence interval on the predicted value, type ```{r confidenceIntervalPrediction} predict(lrm, data.frame(Jan=19.4), int="c") ``` The argument int="c" tells the extractor function predict() to provide an confidence interval on the predicted value. The output includes the predicted value in the column labeled fit() and the lower and upper confidence limits in the columns lwr and upr, respectively. By default the limits define the 95% CI. This can be changed with the level argument. Given the data and the model there is a 95% chance that the interval defined by the limits will cover the true (population) mean when the January SST value is 19.4C. The other source of uncertainty arises from the distribution of a particular value given the conditional mean. That is, even if you know the conditional mean exactly, the distribution of particular values about the mean will have a spread. The *prediction interval* provides a bound on a set of new values from the model that contains both sources of uncertainty. As a consequence, for a given confidence level, the prediction interval will always be wider than the confidence interval. The prediction interval relies on the assumption of normally distributed errors with a constant variance across the values of the explanatory variable. To obtain the prediction interval on the predicted value, type ```{r predictionInterval} predict(lrm, data.frame(Jan=19.4), int="p") ``` Given the data and the model there is a 95% chance that the interval defined by these limits will cover any future value of August SST given that the January SST value is 19.4C. Multiple Linear Regression -------------------------- Multiple regression extends simple linear regression by allowing more than one explanatory variable. Everything from simple regression carries over. Each additional explanatory variable contributes a new term to the model. However, an issue now arises because of possible relationships between the explanatory variables. We continue with a model for predicting August SST values over the North Atlantic using SST values from earlier months. Specifically for this example you are interested in making predictions with the model at the end of March. You have January, February, and March SST values plus Year as the set of explanatory variables. The first step is to plot your response and explanatory variables. This is done with the pairs() function. By including the panel.smooth() function as an argument to panel a local smoother is used on the set of points that allows you to more easily see possible relationships. Here you specify the August values (column 9 in SST) to be plotted in the first row (and column) followed by year and then the January through March values. ```{r pairsFunction} labs = c("Aug SST [C]", "Year", "Jan SST [C]", "Feb SST [C]", "Mar SST [C]") pairs(SST[, c(9, 1:4)], panel=panel.smooth, labels=labs) ``` The scatter plots are arranged in a two-dimensional matrix. The response variable is August SST and the four explanatory variables include Year, and the SST values during January, February, and March. A locally-weighted polynomial smoother with a span of 67% of the points is used to draw the red lines. The diagonal elements of the matrix are the variable labels. The plot in the first row and second column is the August SST values on the vertical axis and the year on the horizontal axis. The plot in row one column three is the August SST values on the vertical axis and January SST values on the horizontal axis, and so on. The plots draw your attention to what explanatory variables might be important in a model. Here all relationships are positive. Specifically, August SST values increase with increasing year and increasing January through March SST values. Based on these bivariate relationships you might expect that all four explanatory variables, with the exception of perhaps Year, will be important in the model of August SST. The plots also show the relationships between the covariates. Here you see a tight linear relationship between each month's SST values. This warrants attention as a model that includes all three SST will contain a large amount of redundant information. Information contained in the February SST values is about the same as the information contained in the January SST values and the information contained in the March SST values is about the same as the information contained in the February SST values. A similar set of plots is available using the ggpairs() function from the **GGally** package. ```{r ggPairs, warning=FALSE, message=FALSE} require(GGally) ggpairs(SST[SST$Year > 1855, ], columns = c("Year", "Jan", "Feb", "Mar", "Aug")) ``` To fit a multiple regression model to these data with August SST as the response variable, type ```{r multipleLinearRegression} m1 = lm(Aug ~ Year + Jan + Feb + Mar, data=SST) ``` Then to examine the model coefficients, type ```{r summaryMultipleLinearRegression} summary(m1) ``` As expected March SST is positively related to August SST, and significantly so. Year is also positively related to August SST. The Year term has a $p$ value on its coefficient that is marginally significant (suggestive, but inconclusive) against the null hypothesis of a zero trend. However, you can see that the coefficient on January SST is positive, but not statistically significant, and the coefficient on February SST is negative. You can see there is a strong *positive* relationship between February SST and August SST, so the fact that the relationship is negative in the context of multiple regression indicates a problem. The problem stems from the correlation between explanatory variables (multicollinearity). Correlation values higher than 0.6 can result in an unstable model because the standard errors on the coefficients are not estimated with enough precision. As long as multicollinearity is not perfect, estimation of the regression coefficients is possible but estimates and standard errors become sensitive to even the slightest change in the data. Prior understanding of the partial correlation (here the correlation between February SST and August SST controlling for March SST) may help argue in favor of retaining two highly-correlated explanatory variables, but in the usual case it is better to eliminate the variable whose relationship with the response variable is harder to explain physically or that has the smaller correlation with the response variable. Here February SST has a smaller correlation with August SST, so you remove it and reestimate the coefficients. You create a new linear model object and summarize it by typing ```{r multipleLinearRegressionModel2} m2 = lm(Aug ~ Year + Jan + Mar, data=SST) summary(m2) ``` The remaining explanatory variables all have a positive relationship with the response variable, consistent with their bivariate plots, but the coefficient on January SST is not statistically significant. Thus it is necessary for you to try a third model with this term removed. ```{r multipleLinearRegressionModel3} m3 = lm(Aug ~ Year + Mar, data=SST) summary(m3) ``` The model makes sense. August SST values are higher when March SST values are higher and vice versa. This relationship holds after accounting for the upward trend in August SST values. Note that the order of the explanatory variables on the right side of the ~ does not matter. That is, you get the same output by typing ```{r orderDoesNotMatter} summary(lm(Aug ~ Mar + Year, data=SST)) ``` The multiple R-squared value is slightly lower in the final model with fewer variables. This is always the case. R squared cannot be used to make meaningful comparison of models with different numbers of explanatory variables. The adjusted R squared can be used to make comparisons as it increases when a term is added to a model only if the term is statistically significant. The final model is checked for adequacy by examining the distribution of model residuals. The five number summary of the residuals given as part of the summary output gives you no reason to suspect the assumption of normally distributed residuals. However, the residuals are likely to have some autocorrelation violating the assumption of independence. ### Choosing Predictors Suppose $H$ is your variable of interest, and $X_1, \ldots, X_p$ a set of potential explanatory variables or predictors, are vectors of $n$ observations. The problem of predictor selection arises when you want to model the relationship between $H$ and a subset of $X_1, \ldots, X_p$, but there is uncertainty about which subset to use. The situation is particularly interesting when $p$ is large and $X_1, \ldots, X_p$ contains redundant and irrelevant variables. You want a model that fits the data well and which produces forecasts with small variability. The problem is these two goals are in conflict. An additional predictor in a model will improve the fit (reduce the bias), but will increase the variance due to a loss in the number of degrees of freedom. This is an example of the bias-variance trade-off. The Akaike Information Criterion (AIC) is a statistic that helps you decide. $$ \mathrm{AIC} = 2(p+1) +n[\log(\mathrm{SSE}/n)], $$ where $p$ is the number of predictors and SSE is the residual sum of squares. You can compare the AIC values when each predictor is added or removed from a given model. For example, if after adding a predictor, the AIC value for the model increases then the trade-off is in favor of the extra degree of freedom and against retaining the predictor. Returning to your original model for August SST, the model is saved in the object m1. The drop1() function takes your regression model object and returns a table showing the effects of dropping, in turn, each variable from the model. To see this, type ```{r dropOneFunction} drop1(m1) ``` Here the full model (all four covariates) has a residual sum of squares (RSS) of `r I(round(drop1(m1)$RSS[1], 2))` (in the row; none dropped). If you drop the Year variable, the RSS increases to `r I(round(drop1(m1)$RSS[2], 2))` and you gain one degree of freedom. That is too much increase in RSS for the gain of only a single degree of freedom, thus the AIC increases to `r I(round(drop1(m1)$RSS[2], 2))`. You conclude Year is too important to drop from the model. This is true of March SST, but not of January or February SST. Therefore to help you choose variables you compare the AIC values for each variable against the AIC value for the full model. If the AIC value is less than the AIC for the full model then the trade-off favors removing the variable from the model. If you repeat the procedure after removing the January and February SST variables you will conclude that there is no statistical reason to make the model simpler. Stepwise regression automates drop1(). It also can be done using forward selection. The AIC is used as a criterion for choosing or deleting a variable and for stopping. To see how this works with your model, type ```{r stepFunction} step(m1) ``` The output is a series of tables showing the RSS and AIC values with successive variables removed. The default method is backward deletion, which amounts to a successive application of the drop1() function. It's a good strategy to try the other selection methods to see if the results are the same. They may not be. If they are you will have greater confidence in your final model. ### Cross-Validation Cross-validation is needed if your statistical model will be used to make actual forecasts. Cross-validation is a procedure to assess how well your scheme will do in forecasting the unknown future. In the case of independent hurricane seasons, it involves withholding a season's worth of data, developing the algorithm on the remaining data, then using the algorithm to predict data from the season that was withheld. Note that if your algorithm involves stepwise regression or machine learning, then the predictor selection component must be part of the cross validation. That is, after removing a season's worth of data, you must run your selection procedure and then make a single-season prediction using the final model(s). And this needs to be done for each season removed. The result of cross-validation is an estimate of out-of-sample error that accurately estimates the average forecast error when the model is used in predicting the future. The out-of-sample prediction error is then compared with the prediction error computed out of sample using a simpler model. If the error is larger with the simpler model, then your model is considered skillful. Note that the prediction error from the simpler model, even if it is long-run climatology, also needs to be obtained using cross-validation. Here we reviewed classical statistics with examples from hurricane climatology. Topics included descriptive statistics, probability and distributions, one- and two-sample tests, statistical formula in R, correlation, and regression.