Use R for Climate
James B. Elsner & Thomas H. Jagger
This tutorial gives you an introduction to the R programming language for
statistical analysis and modeling. Examples are taken from our research
climate. It was developed
with support from the Florida State University, the U.S. National Science
Foundation, and the Risk Prediction Initiative of the Bermuda Institute of Ocean
Sciences. You should follow along with an open R session.
The key strokes are given in red font.
If you're not a fast typer, they can be copied from the pdf (not from DVD) and
pasted into your R session. Output from an R command is given in
New font. Questions for which answers are sought and practice problems are
given in green font.
The tutorial is divided into 7 sections. We begin with some motivations
for giving R a try and some information on how to download and install R on your
computer. We show you how to use R as a calculator. In section 2 we
show you how to enter data into R, how to query your data, and how to obtain
some statistical information about your data. In section 3 we
provide some examples of using R for making inferences from correlations and
differences in sample means. In section 4 we provide an introduction to
using R for graphing your data. In section 5 we show you how R can be used
to simulate various probability models and how to do bootstrap analysis.
In section 6 we provide more advanced material on using R for statistical
modeling. We consider linear regression, logistic regression, Poisson
regression, quantile regression and regression trees. In section 7
we list additional resource material.
Benefits and drawbacks of
Free, open-source, runs on unix, linux, windows, and macs. Built-in help
system. Excellent graphing capabilities. Powerful, extensible, and easy to
learn syntax. Thousands of built-in functions.
Limited graphical user interface meaning it is harder to learn at the outset. No
commercial support. It's a programming language so you need to appreciate syntax
issues. Common metaphors for working with computers are: browsers, iTunes, and
perhaps Excel. R is nothing like these. Irregular users forget commands. There
are no visual cues: A blank screen is intimidating.
Why switch to R?
If you've already mastered a statistical package and if you only carry out a
limited range of statistical tests with no intention of doing something
different, then don't switch. The main reason for switching to R is to
take advantage of its unrivaled coverage and availability of new, cutting edge
applications in emerging statistical fields such as generalized mixed effects
and generalized additive models. Another reason is to be able to
understand the literature. More people are reporting results in the
context of R, and it is important to know what they're talking about. Many
of the world's leading statisticians use and contribute to R.
"...the massive collective international effort represented by the R project
exceeds the most idealistic Marxist imagination." --- Roger Koenker
What is R?
R is an open-source statistical environment modeled after S and S-Plus
(http://www.insightful.com). The S language was developed in the late 1980s at
AT&T labs. The R project was started by Robert Gentleman and Ross Ihaka of
the Statistics Department of the University of Auckland in 1995. It now has a
large audience. It's currently maintained by the R core-development team; an
international group of volunteer developers.
To get to the R project web site, open a browser and in a search window, type
the key words "R project".
The R project web page is
Directions for obtaining the software, accompanying packages and other sources
of documentation are provided at the site.
Downloading and starting
Click on the above url (universal resource locater) and browse to the
site. Click on CRAN (Comprehensive R Archive Network) to download the
latest version of R. Scroll to the nearest mirror site. Click on the
appropriate pre-compiled binary distribution. Click on the base
distribution to download. Only the base directory is needed at this time.
On your computer, click on the download icon and follow the instructions to
Once installed, click on the icon to start R. To open R in Linux, from a
command window, type R.
R is most easily used in an interactive manner. You ask it a question and it
gives you an answer. Questions are asked and answered on the command line.
The > (greater than symbol) is used as the
prompt. In what follows, it is the character that is NOT typed.
If a command is too long, a
+ (plus symbol) is used for the
continuation prompt. If you get completely lost on a command, you can use
Ctrl c or
Esc to get the prompt back and to start over.
To find out how to cite R in your publications, type:
What appears on the screen is a reference citation for R and a BibTeX entry for
Most commands are functions and most functions require parentheses. If the
function has all default options, then the function is applied by typing two
side by side parentheses.
R evaluates commands typed at the prompt. It returns the result of the
computation on the screen or saves it to an object. For example, to find
the sum of the square root of 25 and 2, type:
The  says "first requested element will follow". Here there is only one
element the answer 7. The > prompt that follows indicates R is ready
for another command.
How would you calculate the 5th power of
How about the amount of interest on $1000, compounded annually at 4.5% a.p. for
The help manuals are available as html. To access them, type:
To quit R, type
Statistics is the analysis and modeling of data. The most
useful R command for quickly inputing small amounts of data is the
c function. This function combines
(concatenates) items together. As an example, consider a set of
hypothetical annual land falling hurricane counts over a ten-year period.
2 3 0 3 1 0 0 1 2 1
To enter these into an R session, type (note: do not type the prompt only the
counts = c(2,3,0,3,1,0,0,1,2,1)
 2 3 0 3 1 0 0 1
Notice a few things. You assigned the values to an object called
counts. The assignment
operator is an equal sign (=). Another
valid assignment operator is the arrow
(<-). I think it is easier to stick
with the equal sign. Values do not automatically print. They are assigned
to an object name. They can be printed by typing the object name as
we did on the second line. Finally, the values when printed are prefaced
with a . This indicates that
the object is a vector and the first entry in the vector is a value of 2
(The number immediately to the right of
). More on this later.
You can save yourself a lot of typing if you learn that the arrow keys can be
used to retrieve previous commands. Each command is stored in a history
file and the up arrow key will move backwards through the history file and the
down arrow forwards. The left and right arrow keys will work as expected.
You can also enter small amounts of data with the
scan function. You enter data values
one at a time line by line. When finished type enter.
counts2 = scan()
R comes with many built in functions that you can apply to your
counts data. One of them is the
mean function for finding the average.
To use it, type:
As well, you could apply the functions median
and var to find the median and variance,
respectively. The syntax is the same – the function name followed by parentheses
to contain the argument(s). Most functions have more than one
argument. Arguments not specified are given default values.
The data is stored in R as a vector. This means that R keeps track of the
order that the data were entered. There is a first element, a second
element, and so on. This is good for several reasons.
Our simple data vector of counts has a natural order - year 1, year 2,
etc. We do not want to mix these up.
We would like to be able to make changes to the data item by item instead of
entering the entire data again.
Vectors are math objects so that math operations can be performed on them.
Let's see how these concepts apply to our landfall counts example. First,
suppose counts contain the annual landfall
count from the first decade of a longer record. We want to keep track of
counts over other decades. This could be done by the following, rather
counts # make a copy
counts.decade2 = c(0,5,4,2,3,0,3,3,2,1) # enter a new set of
The pound symbol (#) is used as a comment
character. Anything after the # is
ignored. Adding comments to the history file is a good way of recalling
what you did and why.
Note the two different object names. Unlike many programming
languages, the period in R is only used as punctuation. You can't use an
_ (underscore) to punctuate names as the
underscore is actually another assignment operator although its use is fading.
Now suppose the U.S. National Hurricane Center (NHC) reanalyzes a storm, and
that the 6th year of the 2nd decade is a 1 rather than a 0 for the number of
landfalls. In this case we can type:
counts.decade2 = 1 # assign the
6 year of the decade a value of 1 landfall
The assignment to the 6th entry in the vector
counts.decade2 is done by referencing the 6th
entry of the vector with square brackets
. It is very important to keep this
in mind: Parentheses () are used for
functions and square brackets  are used for
vectors (and arrays, lists, etc).
counts.decade2 #print out the
 0 5 4 2 3 1 3 3
counts.decade2 # print the number of
landfalls during year 2 of the second decade
counts.decade2 # 4th year
counts.decade2[-4] # all but the 4th
 0 5 4 3 1 3 3 2
counts.decade2[c(1,3,5,7,9)] # print the
counts from the odd years
 0 4 3 3
R lets us systematically examine our data. For example, to find the
maximum number of landfalls in the first decade, type:
Which years had the maximum?
counts.decade1 == 3
FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
Notice the usage of double equals signs (==).
This tests all the values of counts.decade1
to see if they are equal to 3. The 2nd and 4th answer yes (TRUE) the others no.
Think of this as asking R a question. Is the value equal to 3? R answers all at
once with a long vector of TRUE's and FALSE's. Now the question is – how can we
get the years corresponding to the TRUE values? Let’s rephrase, which years have
3 landfalls? If you guessed that the command
which will work, you're on your way to
which(counts.decade1 == 3)
The function which.max can be used to get the first maximum.
We might also want to know the total number of landfalls in each decade or the
number of years in a decade without a landfall. Or how about the ratio of
the mean number of landfalls over the two decades.
Here we apply two functions on a single line by using the semicolon
sum(counts.decade1 == 0); sum(counts.decade2 ==
So there is 85% more landfalls during the second decade. Is this
Before moving on, it is good practice to clean up your working directory.
This is done using the rm (remove) function.
Most of your analyzes will be done on data that are read into R. First,
determine what directory you are working in by typing:
You can change your working directory by using the
setwd() function and specifying the
Second, download the data (Save Link or Target As) into your working
directory. The data are available
(http://garnet.fsu.edu/~jelsner/www/data.html) under hurricane landfall counts.
Third, a simple way to read data is to use the
If the file is read in, you should get the > prompt. Note the quotes
around the file name as you are referencing a file outside of R. If you
can use full path names for files. The
header=T argument means there is a header
record in the data (T stands for true).
Data read in with the read.table function are
of class data.frame. To check, type:
To see if the data are what you expect, try the following commands.
names(ushur) # list the column names,
US: US hurr, MUS: major US hur, G: Gulf coast, FL: Florida, E: East
head(ushur) # list the first several
tail(ushur) # list the last several rows
read.table has options for specifying the
separator character or characters between columns used in the native file.
Alternatives to the default use of a space are
sep="\t" for tab. You can also control
the choice of a missing value character, which by default is NA. If the
missing value character in the native file is a 999, specify
There are several variants of read.table that
differ only in having different default parameter settings. Note in
particular read.csv, which has settings that
are suitable for comma delimited (csv) files that have been exported from an
If you want to change the names of the data frame, type:
This changes the 4 column name from G to GC.
Frequently all that's needed is a summary of your data.
To access the columns of a data frame, you give the name of the data frame
followed by the dollar sign ($) followed by
the column name.
What is the median number of U.S. hurricanes over this time period? What
is the variance?
The output carries too many digits. Since there are only about 150 or so
years, the number of significant digits is 2 or 3. This can be changed
using the signif function.
Note that the signif function wraps around
the mean function.
The mean or average number of counts over a time period is called a rate.
How many years are in the data set?
dim(ushur) # dimensions (rows by columns) of
the data frame, which here is the number of years by the number of
So the annual hurricane rate is 1.71 hurricanes per year over the 156-year
To find out which year(s) have the maximum, type
Note how the command structure allows embedding. It is often best to read
the expression from right to left. The first function on the right is max,
so this returns a number giving the maximum annual count. Then we want the
index(es) for that count, finally we want the year(s) corresponding to the
Similarly, to find the years with no hurricanes, type:
1853 1862 1863 1864 1868 1872 1884 1889 1890 1892 1895 1902 1905 1907 1914
 1922 1927 1930 1931 1937 1951 1958 1962 1973 1978 1981 1982 1990 1994 2000
 2001 2006
Note the output is indexed by the values inside the brackets on the left-hand
sum(ushur$US==0) # How many years have zero
table(ushur$US) # How many years have H
Is there a trend in the U.S. landfall counts? One way to get an answer is
to correlate the counts with the year.
What if someone explains that since you are using count data, you should use the
rank correlation. By default, R provides the Pearson product-moment
correlation coefficient. To change the default, type:
method="s" refers to the Spearman rank correlation
3. Simple Inferences
"The modern world, despite a surfeit of
obfuscation, complication, and downright deceit, is not impenetrable, is not
unknowable, and---if the right questions are asked---is even more intriguing
than we think. All it takes is a new way of
looking."---Levitt and Dubner
The correlation is small and not likely significant. To test for
Pearson's product-moment correlation
data: ushur$Year and ushur$US
t = -0.0446, df = 154, p-value = 0.9645
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
More information is returned when inference is sought. Here we examine
the significance of the computed correlation as a test on whether there is a
trend over time in U.S. hurricane counts. Thus the null hypothesis is no
trend (zero correlation).
The t statistic (value) is calculated as the ratio of the correlation to the
standard error of the correlation and this t value is used as the quantile of
a t distribution with N-1 dof
corresponding p-value of 0.96 is evidence in support of the null hypothesis on
a scale between 0 and 1. Thus we confidently state that the small negative
correlation is not significant.
This is exactly the answer we get if we perform a linear regression of U.S.
counts on year and test for a significant trend (see Section 5). If you
learned statistics from a textbook, you will recall distribution tables
appeared in the Appendix. With R these values are accessible through
calls to the appropriate functions (e.g., probabilities from a t distribution
Let's get some corresponding climate data. The data are available
(http://garnet.fsu.edu/~jelsner/www/data.html) under hurricane landfall
counts. The data consist of monthly values for three covariates
that have been identified as important in modulating Atlantic hurricanes; the
North Atlantic Oscillation (NAO), the Southern Oscillation (SOI), and the
Atlantic Multidecadal Oscillation (AMO).
# read the data
# compute the hurricane season (Aug-Oct) average
cor(amoAO,ushur$US) # this gives
an error, the AMO record begins in 1856
cor(amoAO,ushur$US,use="complete.obs") # compute correlations over
all complete obs
Is the correlation between the AMO and U.S.
Suggestive, but inconclusive.
A p value is an estimate of the probability that a particular result, or a
result more extreme than the result observed, could have occurred by chance if
the null hypothesis were true. In short, the p value is a
measure of the credibility of the null hypothesis.
However, the p value is often interpreted as evidence against the null
hypothesis, thus a small p value indicates evidence to reject the
interpretation of the p value as evidence against the null hypothesis is:
p value range
interpreted as evidence against the null hypothesis
suggestive, but inconclusive
Similar to what we did with the amoAO,
create two new objects soiAO and naoMJ for the August through
October SOI and the May through June NAO respectively. Using these objects,
which covariate appears to be most strongly correlated with U.S.
hurricane counts? Note that the relationship with the NAO is a lead
relationship (May-June is before the hurricane season starts) so that it can
be used directly for prediction. Is there any significant correlation
between the covariates?
Two sample tests
Is there evidence of more U.S. hurricanes recently? One way to examine
this question is to divide the record into two samples (e.g., first half
versus second half) and compare the means from both samples. The null
hypothesis in this case is that the sample means are not significantly
Before we can carry out a test to compare two sample means, we need to test
whether the sample variances are significantly different. Here we use
Fisher's F test. The test is based on the ratio of variances, which
under the null hypothesis is 1 and has an F distribution.
Since there are 156 years in the record
(length(ushur$US)), we take the first 78
years for the first sample (s1) and the next 78 years for the second sample
The ratio of variances is:
The value is not equal to one, but is it significantly different? To
data: s1 and
F = 1.0813, num df = 77, denom df = 77, p-value = 0.7326
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
ratio of variances
The F value is the ratio of the variance in sample 1 to the variance in sample
2. Each sample contributes (n-1) degrees of freedom (here, n=78
years). The p-value exceeds 0.15 by a wide margin, thus we have no
evidence that the ratio of variances is different than one.
The 95% confidence interval refers to the ratio of the variances and we see
that the interval includes the value of 1.
Having discounted unequal variance, there are a couple of tests for comparing
two sample means: (1) Student's t test when the samples are independent,
the variances constant, and the values are normally distributed, and (2)
Wilcoxon's rank sum test when the samples are independent, but the values are
not normally distributed.
In the case of count data, the values are non normally distributed, but this
violation is usually not critical, so let's try both tests. The null
hypothesis is that the difference in means is zero.
Two Sample t-test
data: s1 and s2
t = 0.1651, df = 154, p-value =
alternative hypothesis: true
difference in means is not equal to 0
95 percent confidence
mean of x mean of y
Here the t value of 0.1651 is computed as the difference between the two means
divided by the standard error of the difference. In R code
The degrees of freedom (df) are the sum of the sample sizes minus 2.
The p value exceeds 0.15 by a wide margin, thus we have no
evidence that the difference in means is other than zero.
Note that this is a 2-sided p value with the alternative that the difference
in means in not equal to zero. If you are testing whether the mean of
one sample is greater than the mean of the other sample, then you would
include the argument alternative="greater".
Note the default is var.equal=F, so that if
you type t.test(s1,s2) you get an
approximation (Welch) to the degrees of freedom.
The 95% confidence interval is around the difference in means. Since the
interval contains zero, there is no evidence against the null hypothesis.
Sometimes two sample data come from paired observations. In this case
you should always use the paired=T
A non-parametric alternative to the Student's t test when the values are not
normally distributed is the Wilcoxon rank-sum test. The test statistic
(W) is calculated by combining samples and arranging them in order (e.g.,
largest to smallest). A rank is assigned to each value and the ranks are
added for each of the two samples. The significance is assessed based on
the size difference between the cumulative rankings.
Wilcoxon rank sum test with continuity correction
data: s1 and s2
W = 3084.5, p-value = 0.8783
alternative hypothesis: true location shift is not equal to 0
The function wilcox.test uses a normal
approximation to work out a standard normal value from the rank sum value
(W). From this a p value is used to assess the hypothesis that the two
means are the same.
Here the p value, as with the t test, is much greater than 0.15, thus we have
no evidence that the difference in means is other than zero.
Typically the t test will give the lower p value, so the rank sum test is said
to be conservative. The rank sum test is more appropriate than the t
test when the values are not normal.
4. Graphs and
"It is well and good
to opine or theorize about a subject, as humankind is wont to do, but when
moral posturing is replaced by an honest assessment of the data, the result is
often a new, surprising
and Dubner (Freakonomics)
If you do nothing else
with your data you will likely at least make a graph. R has a wide range
of plotting capabilities. It takes time to become a master, but a little
effort goes a long way.
Although the commands are simple and somewhat intuitive, to get a publication
quality figure usually requires tweaking the default settings. This is
true of all plotting software.
Since we need access to the various columns in our data frame, it simplifies
typing if we use the attach and
detach functions. The
attach function makes the columns available
to use in commands without the need to preface the data frame name. For
example, if you type:
Error: object "US"
The error tells you that there is no object with that name in your working
directory. To list the objects, type
ls(). Now type:
This makes things easier. It’s good practice to use the
detach function when you finish with the
data frame to avoid masking objects with the same name from other data frames
To start, let's look at the U.S. hurricane counts over time as a scatter plot.
The default settings on plot give you a first look, but the plot needs to be
modified before serious study.
plot(Year, US, type="l") # Note
the letter in quotes is a small "l" for line, not the number
This is often how you see count data plotted in the hurricane
literature. Ignoring for now the axis labels, what's wrong with
it? To help you with the answer, grab the lower right corner of the plot
window and stretch it to the right. See it?
To avoid this problem, use the
plot(Year,US,type="h",lwd=3) # This
increases the line width from a default of 1 pixel unit to
To make it prettier, type:
Is there any correlation between the covariates? One way to examine this
is to graph a scatterplot.
What do you conclude?
Another useful data display for discrete data is the histogram. This
provides information on the type of distribution. Knowing what kind of
distribution your data conform to is important for modeling.
Let's start with the simplest thing to do.
Not bad, but the location of the abscissa labels
makes it difficult to understand what bar belongs to what hurricane
count. Moreover, since the values are discrete, the years with 0
landfalls are added to the years with 1 landfall. You can see this by
looking at the output from table(US) as we
There are many options for improving the histogram. You can see these
options by typing a question mark before the function name and leaving off the
?hist # brings up a help menu on the topic
Here is one that works by specifying the break points with the
breaks argument. Note also the use of
the seq function.
That's more like it. Now we can see that the
mode is 1 (not 0) and that 2s are more frequent than 0s. The
distribution is said to be "skew to the right" (or positively skew) because
the long tail is on the right-hand side of the histogram.
To overlay a theoretical (model) distribution, type:
The data fits the theoretical distribution quite well. We will look at
these theoretical distributions a bit later.
Years",xlab="Number of Hurricanes")
barplot(table(US)/length(US),ylab="Proportion of Years",xlab="Number of
Note that the command
table(US)/length(US) produces proportions
with the result handed off to barplot to
make a graph. The graph has the same shape as the previous one, but the
height axis is between 0 and 1 as it measures the proportion of years.
To examine the distribution of the covariates together with the distribution
of US hurricanes, type the following sequence of commands:
The par function sets the graphical
parameters. Here they are set by the argument
mfrow, which should be read as
2 x 2
layout". Thus each plot is added to the graph in lexicographical
ordering (top to bottom, left to right).
For continuous variables like the covariates considered here, it is sometimes
better to plot an estimate of the probability density function. The
probability density function (pdf) can be seen as a smoothed version of the
histogram. The probability of observing a value between any two
locations [a, b] is given by the integral of the density function evaluated
between a and b.
The plot of a probability density function first requires a call to the
density function. By default an
appropriate smoothing value (bandwidth) will be chosen as the standard
deviation of the smoothing kernel (window). To examine the pdf for the
default and other bandwidths, type:
To add the locations of the values, use the
One purpose of plotting the pdf is to assess whether the data can be assumed
normally distributed. Clearly in the case of the NAO, the data appears
to be normal.
For a better assessment, you can plot the kth smallest NAO value against the
expected value of the kth smallest value out of n in a standard normal
distribution. In this way you would expect the points on the plot to
form a straight line if the data come from a normal distribution with
any mean and standard deviation. The function for doing
this is called qqnorm.
As the default title of the plot indicates, plots of this kind are also called
"Q-Q plots" (quantile versus quantile). The sample of values has heavy tails
if the outer parts of the curve are steeper than the middle part.
A "box plot", or more descriptively a "box-and-whiskers plot", is a graphical
summary of a set of values. It is used to examine the distribution of
your data. To get a box plot of the NAO values, type:
The box in the middle indicates that the middle 50% of the data (between the
first and third quartiles) lies between about -1 and +0.5 s.d. The
median value is the thick horizontal line inside the box. The lines
("whiskers") show the largest/smallest value that falls within a distance of
1.5 times the box size from the nearest quartile. If any value is
farther away, it is considered "extreme" and is shown separately.
Boxplots can also be used to look at conditional distributions.
For instance, what is the distribution of the NAO values conditional on the
upcoming season having few or many U.S. hurricanes? This is question
that is important as we move toward modeling our data. In section 2 we
noted a negative correlation between the number of U.S. hurricanes and the NAO
thus we would expect to see lower values of NAO when the rate of U.S.
hurricanes is above normal.
The parallel box plots help us visualize the relationship between hurricanes
label=factor(c(rep("On or Before
Box plots can also be used to examine seasonality. The AMO covariate
contains monthly values from 1856-2006. Suppose we want to look at a box
plot of the AMO values by month. Note: Don't worry about memorizing this
amo2=na.omit(amo) # omit rows
with no data
amo2=stack(amo2[,2:13]) # stack
# create an ordered factor (months)
Note that the examples
as described will plot the ﬁgures on the screen. In order to create hard
copy output, we need to specify an output device.
R supports several devices, but by far the most commonly used are the
postscript and pdf devices. We ﬁrst open the device explicitly. For
example, postscript(file="filename") or
All plotting commands
that follow will be written to that device, until you close it with a
dev.off() command. For example,
# file name for pdf file that is dropped outside of the R session in your
plot(ind,amo2$values) # plotting command(s)
dev.off() # close postscript device
The width and height
of the postscript plot is specified with the
width= arguments in the
postscript function. Units on the
values are inches.
Probability and Distributions
The concepts of randomness
and probability are central to statistics and data modeling. Here we
look at the basic ideas of probability with the help of R functions.
Much of the earliest work in probability theory was about games and
gambling. The basic notion is that of random sampling from well-shuffled
card decks or picking colored balls from an urn.
In R you can simulate these situations with the
sample function. If you want to pick
five numbers at random from the set of number 1 through 40, then you type:
The first argument is a vector of values to be sampled and the second is the
sample size. Your sample likely will be different than mine. To
sample 5 years of U.S. hurricanes, you would type:
Note that the default behavior of
sample is sampling without
replacement. In the case of the set of numbers 1 through 40, each number
will be different. In the case of the set of hurricane counts, each
count will correspond to a different year. If you want sampling with
replacement, you need to add the argument
To simulate 10 coin tosses we could write:
The standard distributions that turn up in connection with data modeling and
statistical tests are built into R. Here we look only at a few of these
distributions, but all distributions follow the same pattern.
Four fundamental items can be calculated for a statistical distribution:
Density or point probability
Cumulative probability, distribution function
For all distributions implemented in R, there is a function for each of the four
items. For instance, for the normal distribution, these are named
The density for a continuous distribution is a measure of the probability of
"getting a value close to x". The probability of getting a value in a
particular interval is the area under the corresponding part of the
curve. For discrete distributions, the term "density" is used for
the point probability--the probability of getting exactly the value x.
Technically, this is correct: It is a density with respect to a counting
The density item of the distribution functions can be used to "see" the
distribution. For instance, to see the well-known bell curve of the
normal distribution, type:
An alternative is to type:
This is often a convenient way of making graphs, but it requires that the
y-values can be expressed as a simple functional expression in x.
For discrete distributions, where variables can take on only distinct values, it
is preferable to draw a pin diagram. For instance, what is the probability
of getting exactly x successes in 50 trials with the probability of success for
each trial at 33%?
Notice that there are three arguments to the
dbinom function. In addition to x, you
need to specify the number of trials (size)
and the probability (prob). The
distribution drawn corresponds to, for example the number of 5s or 6s in 50
throws of a fair die. Actually, dnorm
also takes more than one argument, namely the mean and standard deviation, which
by default are set to 0 and 1, respectively since most often it is the standard
normal distribution that is requested. It is
type="h" (as in histogram-like) that causes
the pins to be drawn.
To see how the graph varies with respect to the parameters, we can animate the
evolution by drawing a series of graphs (Baclawski, 2008). We need to
create and display a series of graphs in quick succession. We do this by
defining a function then applying it to a sequence of parameters.
There are several R utilities available for repeatedly applying a
function. The most common is
sapply function applies every element of the
first argument to the function defined in the second argument and returns a
vector with the values computed by the function.
The function we want to apply in this case is the
plot function. However, because the
graphs are drawn quickly it is necessary to add a pause between them. This
is done with the Sys.sleep function.
It causes the program to stop running for a specified number of seconds.
In animation this is called the dwell rate.
Our first animation shows how the binomial distribution varies with the
probability of success.
Successes",ylab="Probability of Success")
To run your function binom on a series of
probabilities from 0 to 1 in steps of .1, type:
Notes: (1) To suppress the output to the screen, we
can assign it to a dummy variable. (2) To make the y-axis scale the same for
each plot, we can add ylim=c(0,0.5) to the
ylab="Probability of Success")
Next we look at what happens when the number of trials varies but the
probability of success is fixed at p=1/2.
ylab="Probability of Success")
The result is quite remarkable. The graph quickly converges to a shape
that changes little except that it becomes narrower. This is the graphical
manifestation of the Central Limit Theorem (the most important theorem in
probability). To see the convergence more dramatically, we need to control
Successes",ylab="Probability of Success")
Repeat using a probability of success =
Cumulative distribution functions
The cumulative probability function describes the probability of "hitting" x or
less in a given distribution. The corresponding R functions begin with a
'p' (for probability) by convention.
Just as you can plot densities, you can also plot cumulative distribution
functions, but that is usually not very informative. More often, actual
numbers are desired. For example say you are interested in knowing the
probability of more than 3 U.S. hurricanes next year, given that the mean number
is 1.71. In this case you would type:
Thus knowing the rate of U.S. hurricanes is 1.71 per year, the probability of
observing more than 3 U.S. hurricanes next year is 9.5% assuming all else the
Question: What is the probability that next year's NAO value averaged over May
and June will be less than or equal to -1 sd?
Similar to dnorm, we can also plot the cumulative distribution:
The quantile function is the inverse of the cumulative distribution
function. The p-quantile is the value with the property that there
is a probability p of getting a value less than or equal to it. The median
is by definition the 50% quantile. Thus the 25% quantile of the NAO data
should be about -1. To verify, type:
To find the 95% confidence interval about the population mean NAO value, we can
sigma= sd(naoMJ) # sample standard
n=length(naoMJ) # sample
sem=sigma/sqrt(n) # standard error of the
xbar+sem*qnorm(0.975) # lower and upper end points of the 95%
confidence interval about the mean
Note that qnorm(0.025) =
We can also plot the quantile function with suitable limits:
curve(qnorm(x), from=0, to=1)
To many people it sounds like a contradiction to generate random numbers on a
computer since computer results are predictable and reproducible. What is
in fact possible is to generate sequences of "pseudo-random" numbers, which for
practical purposes behave as if they were drawn randomly.
The use of the functions in R that generate random numbers is
straightforward. The first argument specifies the number of random values
to generate and the subsequent arguments refer to characteristics of the
distributions similar to those for the other distributional functions.
To generate 10 random numbers from a standard normal distribution, type:
rnorm(10,mean=-0.336,sd=0.99) # 10 random
normal values with the same mean and variance as the NAO May-June
rbinom(10, size=20,prob=0.5) # 10 random
binomial values where the number represents the number of successes in 20
This last one simulates and plots the U.S. hurricane record. Note that
these simulations often contain periods of high and low activity by chance.
Finally, let's generate a histogram of random Poisson counts and compare it with
our histogram of U.S. hurricanes from section 3.
You've probably heard the old phrase about pulling yourself up by your own
bootstraps. That is where the term 'bootstrap' in statistics comes
from. You might only have a single sample of hurricane observations over
time some time period, but you can sample (bootstrap) from the original sample
in many ways so long as you allow some values to appear more than once and other
not to appear at all (sampling with replacement). You calculate the mean
of the values in each bootstrap sample and obtain a confidence interval by
looking at the extreme highs and lows using the
for(i in 1:1000)
hist(a,main="", xlab="Historical Annual U.S.
Note the large range of bootstrapped hurricane rates. A 95% confidence
interval spans a factor of 3. This is characteristic of small count data
and should be kept in mind when describing periods as "active" and "inactive".
6. Data Models
distributions like the normal and the Poisson, are examples of data
models. If your data conform to a theoretical distribution then the
analysis is simplified, because the properties of the distributions are known.
Fitting models to data is a central function of R. The object is to
determine a minimal adequate model from the large set of potential models that
might be used to describe the given set of data.
Parsimony says that, other things being equal, we prefer:
A model with n-1 parameters to a model with n parameters
A model with k-1 explanatory variable to a model with k explanatory
A linear model to a model that is curved
A model without interactions to a model with interactions
We also prefer models containing explanatory variables that are easy to measure
over variables that are difficult to measure or are derived (e.g., EOFs).
And we prefer models that are based on a sound mechanistic understanding of the
process over purely empirical functions.
All of the above are subject to the caveats that the simplifications make good
scientific sense and do not lead to significant reductions in explanatory
power. There is no perfect model.
Models in R follow the following formula: response variable ~
explanatory variable(s), where the ~ reads "is modeled as a function of".
There is a temptation to become personally attached to a particular model.
You do well to remember the following truths about models:
All models are wrong.
Some models are better than others.
Some models are useful.
The correct model can never be known with certainty.
The simpler the model, the better it is all else the same.
After fitting a model to your data it is essential that you investigate how well
the model describes the data. In particular, are there any systematic
trends in the goodness of fit? This is done by examining the model
residuals which are computed as the difference between the observed response and
the predicted response.
section 2 we saw how to get the correlation between two variables. Correlation
is nothing more than a statistical term that indicates whether two variables
move together. It tends to be cold outside when it snows; these two
factors are positively correlated. Sunshine and rain, meanwhile, are
Easy enough so long as there are only a couple of variables. With many
variables things get harder. Regression analysis is the tool that enables
a geographer to sort out piles of data with many variables. It does so by
artificially holding constant every variable except the two he wishes to focus
on, then showing how these two co-vary.
It has been said that economics is a science with excellent tools for gaining
answers but a serious shortage of interesting questions. Geography is in
the other boat. It has no shortage of interesting questions, but it tends
to lack tools for gaining answers. Of all the tools in an economists
toolbox, the most useful is regression.
Mathematically, regression attributes to a response
variable y a normal distribution whose expected value depends on a covariate x,
typically in the following way: E[y] = a + b x. The parameters
a and b are estimated by a least-squares minimization procedure. It is
called "linear" regression because the variables multiplied by their parameters
are added together.
For linear regression, the function lm is
used (linear model). To obtain a linear regression model of the August
through October AMO on the May through June NAO, type:
lm(formula = amoAO ~ naoMJ)
The argument to the lm function is a model
formula. The variable to the left of the tilde is the dependent variable
and the variable to the right is the independent variable.
Regression is a model of the response (dependent) variable (usually denoted by
"y") ON the explanatory (independent) variable, (usually denoted by "x").
You regress y onto x. Unlike correlation, which is not a model for the
data, it matters which variable is which and the choice is made by the
investigator, not the software.
Continuous explanatory variables are
The statement "I regressed variable A and variable B" is vague.
The coefficients are the values of the intercept and slope of the straight line
through the scatter of points when the NAO is on the horizontal axis. The
equation for the line is read as:
The mean of Aug-Oct AMO in units of degrees C equals 22.7 minus 0.02687 times
the May-Jun NAO in units of s.d. Thus the units on the intercept are the
same as the units of the response variable, in this case degrees C, and the
units on the slope are degrees C/s.d.
To see the line, type:
The regression model is different if the NAO is
regressed on the AMO. Try it. Also consider a regression of
hurricane counts on year.
In its raw form, the output of lm is very
brief. All you see is the estimated intercept and estimated slope values,
but there are no tests of significance. The result of
lm is a model object. This is a
distinctive concept in R. Where other statistical programs focus on
generating printed output that can be controlled by setting options, you get
instead the result of a model fit encapsulated in an object from which the
desired quantities can be obtained using other functions.
More output is obtained if you type:
lm(formula = amoAO ~ naoMJ)
Median 3Q Max
-0.52648 -0.14007 -0.02344 0.12599 0.65141
Error t value Pr(>|t|)
(Intercept) 22.79375 0.02008 1135.221 <2e-16
0.01920 -1.399 0.164
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2345 on 149 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-Squared: 0.01297, Adjusted R-squared: 0.006343
F-statistic: 1.957 on 1 and 149 DF, p-value: 0.1639
Let's dissect the output.
lm(formula = amoAO ~ naoMJ)
The output starts with a repeat of the
function call. This is not very interesting, but it is useful if the
result is saved in an object that is examined later.
Median 3Q Max
-0.52648 -0.14007 -0.02344 0.12599 0.65141
This gives us a quick look at the distribution of the residuals. The
residuals are the key for understanding how well the model fits the data and
whether there are problems. The average of the residuals is zero by
definition, so the median should not be far from zero and the minimum and
maximum should roughly be equal in absolute magnitude. Similar with the
1st (1Q) and 3rd (3Q) quartile values.
Error t valu Pr(>|t|)
(Intercept) 22.79375 0.02008 1135.221 <2e-16
0.01920 -1.399 0.164
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Here we again see the regression coefficients including the intercept and slope,
but this time with accompanying standard errors, t values and p values.
Note that the slope value and its statistics are given to the right of the
covariate, but they refer to the coefficient that is multiplied by the covariate
values, not the covariate values themselves. To the right of the p values
is an indicator of the level of significance. Here we note that the NAO is
not a very good predictor of the AMO.
standard error: 0.2345 on 149 degrees of freedom
This is the residual variation, an expression of the variation of the
observations around the regression line, estimating the model parameter
sigma. The degrees of freedom is the number of observations minus 2 (since
there are two model parameters).
observations deleted due to missingness)
This indicates that there are 5 missing values in either the response or
covariate or both. Rows with missing values are not included in the model
(deleted). Care must be exercised when comparing models built with
differing data set sizes.
R-Squared: 0.01297, Adjusted R-squared:
The first item is the R squared value. When multiplied by 100%, it is
interpreted as the variance explained by the covariate.
R squared =
The adjusted R squared is an index that accounts for the loss in the degrees of
freedom when adding a covariate to the model. It is important in the
context of multiple regression.
1.957 on 1 and 149 DF, p-value: 0.1639
This is an F test for the hypothesis that the slope is zero. The
F-statistic has a value of 1.957 on 1 and 149 degrees of freedom (DF). For the
case of a one covariate model, it gives the same p-value as the one on the slope
coefficient. It is more interesting in the case of more than one
Another example: airquality data set.
Regress ozone concentration (Ozone) on solar
radiation (Solar.R), wind speed
(Wind), and temperature
The R-squared value can be manipulated simply by adding more explanatory
variables. The adjusted R-squared value is an attempt to correct this
short-coming by adjusting both the numerator and the denominator by their
respective degrees of freedom.
adjusted R squared = 1- (1 - R2 )[(n - 1)/(n - p - 1)]
R2 = R squared
n = number of observations
p = number of explanatory variables
Unlike the R-squared, the adjusted R-squared will decline in value if the
contribution to the explained deviation by the additional variable is less than
the impact due to the loss in degrees of freedom. Thus it is a better way to
compare models than using R-squared.
Note: while the R-squared is a percent, the adjusted R-squared is not and should
be referred to as an index value.
To obtain confidence intervals on the model coefficients type:
To get predictions from the model for
each set of explanatory variables type:
To get point-wise confidence intervals on the predictions
To get prediction intervals use
By fitting a linear regression model, you are making several implicit
assumptions including linearity, normally distributed residuals, constant
variance, and independence.
assumption: The means of the subpopulations fall on a straight-line function
of the explanatory variables.
assumption: There is a normally distributed subpopulation of responses for
each value of the explanatory variable.
variance assumption: The subpopulation standard deviation are all equal.
assumption: The selection of an observation from any of the subpopulation is
independent of the selection of any other observation.
The first three assumptions are best checked using plots.
The fourth assumption can be checked by plotting the autocorrelation function of
the model residuals.
It is also a good idea to check for collinearity among the explanatory
variables. This can be done with the cor
function as we saw earlier. If the correlation between two
explanatory variables is between -0.6 and +0.6 then collinearity is usually not
a big problem.
Modeling the relationship between explanatory and response variables is a
fundamental activity in applied statistics. Often the response is not a
numerical value. Instead, the response is one of two possible outcomes (a
binary response), e.g., landfall or no landfall.
Here we look at binary response data and its analysis through logistic
regression. Concepts from simple and multiple regression carry over to
logistic regression. The idea of maximum likelihood (ML) estimation is
central to the modeling of binary data.
All applications of logistic regression have two things in common, (1) Binary
response, (2) Involve the idea of prediction of a chance, probability,
proportion or percentage. The response variable is bounded from below by 0
and from above by 1.
The mean of a binary response is a probability. For example, generate a
set of random binary responses.
Repeat several times, then change 30
to 300 and repeat several times.
Thus a logistic regression model specifies that a probability---such as the
probability of a hurricane---is related to the explanatory variables, such as
SST and ENSO.
Interpretation of logistic regression coefficients are made in terms of
statements about odds and odds ratios. Although the model, estimation, and
inference procedure differ from those of ordinary regression, the practical use
of logistic regression analysis parallels the usual regression analysis closely.
Problems with linear regression when the response variable is
The problem with linear regression with a binary response variable arises from
the fact that probabilities have maximum and minimum values of 1 and 0. By
definition, probabilities and proportions cannot exceed 1 or fall below 0.
Yet, the linear regression line can extend beyond these limits. Depending on the
slope of the regression line and the observed values of the explanatory
variable, a model can give predicted values of the response above 1 and below
0. Values beyond the range of 0 and 1 make no sense, and have little
Example: Consider the data collected from the recovered solid boosters of the 23
previous shuttle flights (before the Challenger disaster). The damage
variable is coded as 0 for no damage, and 1 for damage. The temperature is
the air temperature in F at the launch site.
A scatter plot of damage and temperature with the best fit line (damage is the
response variable) is obtained by typing:
The plot shows there is less likely to be damage when temperatures are warmer,
but a line does not do justice to the relationship. Regress
temp. Is the model significant?
Yes, temperature is significant in explaining O-ring damage.
Write an equation for the
Predicted damage = 2.905 –0.037 x
predicted damage can be interpreted as the predicted probability of damage for
What is the predicted probability of damage if the temperature is 77F, 65F, 51F?
1 2 3
0.026 0.480 1.000
Although this appears to be a reasonable prediction equation, it can lead to
nonsensical predictions. For example, what is the predicted probability of
damage when the temperature is 85F or 45F?
These are nonsense probabilities. The bounded nature
of probabilities is not incorporated into the linear regression model.
Another clue to the inadequacy of the linear regression model for these data is
shown in the pattern of the residuals.
Rather than exhibiting a random scatter of points, the plot of residuals versus
temperature shows two slanted lines.
Because of these problems we need to consider another type of model for binary
A generalized linear model
A regression model has two parts. 1) the response function, 2) regression
structure. With linear regression the response function is the mean of y
given x and the regression structure is a summation of the explanatory variables
multiplied by their coefficients. In equation form the two parts are
separated by an equal sign
mean(y|x) = a + b * x
In R the two parts of a regression model are separated by the tweedle.
A generalization of the linear regression model is made by considering the
response function to be the logit of the mean. The logit is given by
logit(mean) = log(mean/(1-mean))
Note that the logit is the logarithm of the odds expressed as a fraction.
Odds are for:against. If there is a 2:1 odds of something happening, then
the probability it will happen is 2/(1+2) or 2/3. (2/3 / (1-2/3)) = 2,
which is the fraction for/against.
The inverse of the logit function is called the logistic function, so that if
Note that the logit is a nonlinear function, but with logistic regression there
is no change to the regression structure, so we call the model a generalized
linear model. It is linear in the regression structure, but nonlinear in
the response function.
Since the mean is a probability for a set of binary outcomes, it is customary to
use the symbol pi, so the logit function is written as
logit(pi) = log(pi/(1-pi))
Example: Shuttle O-ring damage
To model the probability of O-ring damage as
a logistic regression using air temperature at the launch site as the
explanatory variable, type
Significance and adequacy of the logistic regression model are addressed using
#Is the model significant?
pchisq(20.315,21,lower.tail=F) #Is the model
To predict the probability of O-ring damage when the launch temperature is 55F,
To plot the model on the scatter of points, type:
a binary variable of US hurricane counts based on whether or not a year had 2 or
more hurricanes (e.g., US>1). Regress this binary variable on NAOmj
(North Atlantic Oscillation during May and June in units of s.d.), SOIao
(Southern Oscillation Index during August through October in units of s.d.),
SSTao (Atlantic sea-surface temperature during August through October in units
of degrees C), and SSNao (sunspot number averaged over August through
October). Which variables are significant? Rerun the model with the
insignificant variable(s) removed. Is the reduced model significant?
Is the reduced model adequate? Use the model to predict the probability of
a multiple hit year given an NAO value of -2 s.d., a SOI value of 2 s.d., and a
SST value of 23 C.
The Poisson distribution is one of the most useful and important of the discrete
probability models for describing count data. We know how many times
something happened. The number of lightning strikes over Tallahassee or
the number of hurricanes affecting the state of Florida, are examples.
The Poisson distribution is a one-parameter distribution with the interesting
property that its variance is equal to its mean. A large number of
processes show variance increasing with the mean, often faster than linearly.
The density function of the Poisson distribution shows the probability of
obtaining a count x when the mean count per unit is lambda. It is given by
the exponential of minus lambda times lambda raised to the x power all divided
by x factorial.
The probability of zero events is obtained by setting x to 0, which yields
exp(-lambda). Thus the probability of zero hurricanes affecting the U.S. next
year, given a mean rate of 1.7 hurricanes/year is 18%.
Of course, this implies that the probability of at least 1 hurricane is 100-18
The dpois function evaluates the above
equation for any count of interest. To determine the probability of
observing exactly 1 hurricane when the rate is 1.7 hurricanes per year, type:
Or the probability of 5 hurricanes when the
rate is 1.7 expressed in percent.
To answer the question, what is the
probability of 1 or fewer hurricanes we use the cumulative probability function
ppois. Thus to answer the question,
what is the probability of more than 2 hurricanes, we type:
If we want to simulate 100 seasons of hurricane strikes on the U.S. coast with a
rate that is commensurate to the long-term rate of 1.7 hurricanes per year, we
To see the distribution, type:
Repeat for another 100 years.
To plot the distribution, type:
Poisson regression model
What if the rate of hurricanes depends on time or climate variables? Note
the way the question is worded. We are interested in the rate of
hurricanes because given the rate, the counts follow a Poisson
distribution. Thus we are not modeling counts directly.
There's a mistaken belief that Poisson regression is least squares regression on
the logarithm of counts. It's not. It is a regression using the
logarithm of the rate. It's nonlinear in the regression function, but
linear in regression structure and the model coefficients are determined by the
method of maximum likelihoods rather than the method of least squares.
Our interest here is to study the relationship between the climate variables and
coastal hurricane numbers. We do this with a regression model that
specifies that the logarithm of the annual rate is linearly related to the
The Poisson regression model attributes to a response variable y a
Poisson distribution whose expected value depends on a covariate x, typically in
the following way: log(E[y]) = a + b x. The parameters a and b are
estimated via the maximum likelihood procedure.
glm(formula = ushur$US ~ naoMJ, family = poisson)
-2.1480 -0.8054 -0.1172 0.5346 2.8388
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.44195 0.07019 6.297
0.06336 -3.426 0.000612 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 207.16 on 155 degrees of freedom
Residual deviance: 195.26 on 154 degrees of freedom
Number of Fisher Scoring iterations: 5
We note that the coefficient on the
NAO covariate is highly significant (p value < 0.001). The negative
sign on the coefficient indicates that the there are fewer U.S. hurricanes when
the NAO is high.
The value of -0.217 indicates that for a one s.d. change in the NAO, the
difference in the logarithm of expected counts is -0.217. If there are
other covariates we must add "given the other covariates in the model are
held constant". Since log(A)-log(B) = log(A/B), we can interpret the
parameter estimate as the log of the ratio of expected counts.
The residual deviance replaces the residual standard error and is used to
determine model fit.
regression, like logistic regression, the drop in deviance (deviance of the null
model minus the deviance of the full model) is used to test whether the model is
significant against the null hypothesis that none of the covariates are
useful. Since the model contains significant covariates the model itself
is significant. To see this, type:
Model adequacy is
examined with the residual deviance.
The residual deviance is
on 154 degrees of freedom. A p-value from a chi-squared distribution with this
quantile value and this dof is evidence in favor of the null hypothesis that the
model is adequate. A small p-value is evidence against model adequacy.
Here we find moderate evidence against model adequacy indicating that other
covariates are missing from the model. NAO is not the entire story.
While a Poisson distribution for given x has variance of y equal to mean of y,
this assumption may not be supported in the data. If variance tends to
exceed the mean, the data is over-dispersed relative to the Poisson
distribution; under-dispersion occurs when variance is smaller than the
mean. If a Poisson model seems appropriate but doesn’t fit well,
over-dispersion or under-dispersion could be the reason.
The AIC is the Akaike information criteria which is equal to 2 times the number
of model parameters minus 2 times the logarithm of the likelihood
function. The AIC is an operational way of trading off the
complexity of an estimated model against how well the model fits the data.
The smaller the AIC, the better the model. Adding a covariate that
captures the probability of hurricane genesis will likely improve the model.
Fit a Poisson regression model to the U.S. hurricane
counts using all three covariates. Interpret the results.
Quantile regression extends ordinary least squares regression model to
conditional quantiles (e.g., 90th percentile) of the response variable.
Recall that ordinary regression is a model for the conditional mean (the mean of
the response is conditional on the value of the explanatory variables). It is a
semi-parametric technique because it relies on the non-parametric quantiles, but
uses parameters to assess the relationship between the quantile and the
Quantiles are points taken at regular intervals from the cumulative distribution
function of a random variable. The quantiles mark a set of ordered data into
equal-sized data subsets.
Download the set of per storm maximum wind speeds (Save Link or Target As) into
your working directory. The data are available
under per storm maximum wind speeds.
Read the data into your current R session and check the values from the first
six rows by typing:
Yr Region Wmax nao
sstmda sun split
1 1899 Basin 96.64138 -0.64 -0.21 0.05193367 -0.03133333
2 1899 East 90.19791 -0.64 -0.21 0.05193367 -0.03133333
3 1899 Basin 90.35300 -0.64 -0.21 0.05193367 -0.03133333
4 1899 East 37.51933 -0.64 -0.21 0.05193367 -0.03133333
5 1899 Basin 51.06743 -0.64 -0.21 0.05193367 -0.03133333
6 1899 Basin 40.00000 -0.64 -0.21 0.05193367 -0.03133333
Here Wmax is the maximum estimated wind speed for each cyclone in the North
Atlantic by region. Basin refers to the region away from the U.S. coast.
Quantiles and distributions
Attach the data frame, subset by Basin, and determine the quartiles (0.25 and
0.75 quantiles) of the wind speed distribution.
We see that 25% of the storms have a maximum wind speed less than 51 kt and 75%
have a maximum wind speed less than 96 kt so that 50% of all storms have a
maximum wind speed between 51 and 96 kt (interquartile range).
To see if there is a trend in maximum wind speed values over time, we can use
the linear regression technology from a previous section. However, we note
that the distribution of per cyclone maximum wind speed is not normal so it is
best to transform the wind speeds before using linear regression. A
logarithmic transform is often a good choice.
To examine the distributions of the raw and log transformed wind speeds, type:
To see if there is a significant trend in the mean wind speed over time, type:
Estimate Std. Error t value Pr(>|t|)
0.7600147 4.676 3.33e-06
0.0003521 0.0003880 0.907 0.364
The answer is unequivocally; no.
However, this is a model for the conditional mean (here, conditional on
year). What about higher wind speeds? The theory of maximum
potential intensity refers to storms at their maximum possible wind speed.
The subset of upper quartile maximum wind speeds would be on average a subset of
storms closer their maximum possible wind speed.
So what we need is a model for the conditional quantiles.
Here we restrict the dataset to the years since satellite data.
The graph verifies the lack of trend in the central tendency of maximum cyclone
intensities. It also shows a tendency for the strongest storms to get
stronger over time. To quantify this trend and determine significance we
turn to a quantile regression model.
Quantile regression models
The functionality is available in the
quantreg package thanks to Roger Koenker.
We begin by running a median regression on year and
checking if the parameter (slope) is significantly different from
Call: rq(formula =
Wmax ~ Yr, tau = 0.5)
tau:  0.5
Value Std. Error t value Pr(>|t|)
(Intercept) 118.58813 437.70671 0.27093 0.78661
-0.02639 0.21955 -0.12022 0.90438
As expected we find no significant trend in the median storm intensity over
time. How about trends in the upper quantiles?
At the 75th and 90th percentiles, there is
suggestive, but inconclusive evidence of a trend, but at the 95th percentile
there is convincing evidence of an increase in the intensity of the strongest
hurricanes in accord with MPI theory.
To show the results on a plot, type:
The dots with lines indicate the trend for quantiles in the range
between 0.7 and 0.95 by increments of 0.01. The gray shading indicates the
95% confidence interval on the slope estimate. The red solid line is the
least squares regression slope and the dotted lines are the 95% confidence
interval about that estimate.
Quantile regression provides a more complete picture of how the distribution of
the response variable is conditioned on the covariates.
Classification and regression trees
An inherent limitation of multiple regression models is the difficulty in
providing a simple holistic interpretation of the results. This limitation
is most apparent when attempts are made to use the model for making
decisions. An alternative to multiple regression is a family of models
known as classification and regression trees (C&RT). These models are
easier to interpret in the context of decision making.
It is a classification tree if the response variable is categorical (warm vs
cold) and a regression tree if the response variable is continuous
C&RT models have been used in many fields. An early application of a
C&RT model involved the identification of ship structures from radar
profiles. Hospitals use them to identify indicators of heart failure.
C&RT models are widely used in the finance industry.
C&RT models are nonparametric. Explanatory variables and their
interactions are selected based on their ability to parse the dependent
variable. When the more stringent theoretical and distributional assumptions of
more traditional methods (linear regression) are met, the traditional methods
are preferable. But as an exploratory model, or as a model of last resort when
traditional methods fail C&RT models are, in the opinion of many
researchers, quite useful.
When there are many explanatory variables C&RT models can be quite complex,
but graphs help simplify interpretation.
Before thinking about how C&RT models work, it helps to see one in
action. Download and install the tree
package from CRAN and read the data. The package is available thanks to
Provided that your machine has a proper internet
connection and you have write permission in the appropriate system directories,
the installation of the package should proceed automatically. Once the
package is installed, it needs to be made accessible to the current R
session. To do this type:
These procedures provide access to a large
collection of specialized packages for statistical analysis.
The data we will use are in the file pollute.txt.
"Industry" "Population" "Wind"
 "Rain" "Wet.days"
Create a C&RT model object using the
Instead of interpreting the parameter values from
the table of coefficients, the model is interpreted from the upside-down tree
You follow a path from
the top of the diagram (root of an upside-down tree) and proceed to one of the
terminal nodes (called a leaf) by following a series of rules (called
splits). The number at the tip of the leaf is the mean value of the
response variable in that subset of values (in this case SO_2
concentration). The inequality listed on the branch is the rule.
Rules are applied to the set of explanatory variables.
In the pollution example the first split of the response variable Pollution is
based on the explanatory variable Industry. The split is a rule (logical
operation) on the explanatory variable using a threshold value. Is the
value of Industry less than 748 units? If yes, branch to the left; if no
branch to the right. The left branch proceeds to another split while the
right branch proceeds to a leaf.
The model is fitted using binary recursive partitioning. The response
variable is successively split along coordinate axes of the explanatory
variables so that at any node, the split which maximally distinguishes the
response variable is selected. Splitting continues until nodes cannot be
split or there are too few cases (less than 6 by default).
Each explanatory variable is assessed, and the variable explaining the greatest
amount of deviance in the response variable is selected. Deviance is
calculated on the basis of a threshold value which produces two mean values for
the response (one mean above the threshold, the other below the
the variable Rain does not enter the
Since the first split is on the variable
Industry at a value of 748, let's look at how
Pollution is partitioned on this variable.
This tells us that when
Industry is greater or equal to 748, the mean
value for Pollution is 67 and when it is less
than 748, the mean value of Pollution is
24.92. To see this graphically we can use the following lines of code.
The plot shows that the split of Pollution on
Industry<748 results in two means for
Pollution that are quite a bit different (in
fact they are maximally different over all such splits and over all explanatory
variables). The split results in less than 6 observations of
Industry>=748, so a terminal node is
produced and the mean value of Pollution for
these (5) observations is 67.
How the procedure works
The procedure works like this; for a given explanatory variable
Select a threshold value (the vertical dotted line)
Compute the mean of the response above and below this threshold (the two
Use the means to compute the deviance
Go through all possible values of the threshold (values on the x axis)
Look to see which value of the threshold gives the lowest deviance
Split the entire data set into high and low subsets on the basis of this
threshold for this variable
Repeat the entire procedure on each subset of the data on either side of the
Keep going until no additional reduction in deviance is obtained, or there
are too few data points to merit further subdivision
The value of a split is defined as the reduction in the deviance and is
represented on the tree plot as the length of the hanging branch. Thus for
instance, the reduction in deviance is larger for the split on
Population than for the split on
The key questions are: Which variables to use for the split, and how best
to achieve the split for the selected variable.
Tree regression is analogous to forward selection in multiple regression.
Model output is printed by typing:
n, deviance, yval
* denotes terminal node
1) root 41 22040 30.05
2) Industry < 748 36 11260 24.92
4) Population < 190 7 4096 43.43 *
5) Population > 190 29 4187 20.45
10) Wet.days < 108 11 96
11) Wet.days > 108 18 2826 25.61
22) Temp < 59.35 13 1895
44) Wind < 9.65
8 1213 33.88 *
45) Wind > 9.65
5 318 23.00 *
23) Temp > 59.35 5 152
3) Industry > 748 5 3002 67.00 *
The terminal nodes are denoted by an asterisk (there are 6 of them). The
node number is on the left and labeled by the variable on which the split at
that node was made. Next comes the split criterion which shows the
threshold value and an inequality sign. The number of cases going into the
split (or terminal node) comes next. The deviance value and the mean over
all cases are shown as the next two figures.
At the root node the mean is 30.05
(mean(Pollution)) with a deviance of 22040
from 41 cases. With the split on
Industry at 748 we see the deviance decreases
to 11260 (node 2) plus 3002 (node 3). Since node 3 has less than 6 cases,
it becomes a terminal node. The remaining deviance from node 2 is split
using Population at a value of 190. The
highest mean value of Pollution according to this partition is 67 (node 3) and
the lowest is 12 (node 10).
Note how the nodes are nested: within node 2, for example, node 4 is terminal
but node 5 is not; within node 5 node 10 is terminal but node 11 is not; within
node 11, node 23 is terminal but node 22 is not, and so on.
Tree regressions lend themselves to circumspect and critical analysis of complex
. In the present example, the aim is to understand the causes of variation
in air pollution levels from case to case. The interpretation of the tree
regression is as follows:
This kind of complex and contingent explanation is easier to make with tree
The five most extreme cases of Industry
stand out and need to be considered separately.
For the rest, Population is the most
important variable, but interestingly, it is low populations that are
associated with the highest levels of pollution. Ask yourself which
might be cause, and which might be effect.
For high levels of population (> 190), the number of wet days is a key
determinant of pollution; the places with the fewest wet days (less than 108
per year) have the lowest pollution levels.
For those places with more than 108 wet days, it is temperature that is most
important in explaining variation in pollution levels; the warmest places
have the lowest air pollution levels.
For the cooler places with more wet days, it is wind speed that matters: he
windier places are less polluted.
Like multiple regression, the tree regression can be used for making
predictions. The predicted values are easier to anticipate in this
case. For example, it is clear that if the
Industry variable exceeds 748, then the model
will prediction a pollution value of 67. If Industry is less than 748 and
Population is less than 190, then the predicted Pollution value is 43.43.
To see this type:
Note that we can zero out the values for the other predictors since they do not
enter the prediction for these values of Industry and Industry and
Prediction using regression trees is simply following the set of rules as
organized by the tree. We start with the value of the variable Industry
and work our way up (down) the tree to we get to a terminal node. The
value of the response variable representing the subset mean is the predicted
The standard error on the predicted value is more difficult to work out.
We can think of using the standard error on the subset mean, but that ignores
the uncertainty surrounding the other decisions that came before.
Model simplification in regression trees is similar to simplification in
multiple regression. There is a trade-off between bias and variance.
If the model is overly complex with many variables then the bias will be low but
variance will be high on the predicted values. If there are too few
variables then variance will be low, but the predicted values will be biased
against the excluded variables.
To deal with this trade-off, you should consider pruning your tree. The
function prune.tree determines a nested
sequence of smaller trees from the largest tree by snipping off the least
important splits. The function returns the tree size (number of terminal
nodes, $size), the total deviance of each tree ($dev), and the drop in deviance
by going from a smaller tree to a larger tree ($k).
Thus it is useful to plot the deviance as a function of tree size.
We can see that the drop in deviance diminishes with each node added.
Deviance declines with increasing complexity. In fact, we could argue that
a tree with 4 terminal nodes is sufficiently complex as the drop in deviance
from 4 to 5 nodes is relatively small.
Suppose we want the best tree with four nodes.
If the response variable is categorical rather than continuous, the tree
model is one of classification. A classification tree is based on a set of
if-then rules. "If petal length > x, then..., but if petal length <
x, then something else".
Consider the intrinsic data frame iris.
Use Species as the response variable and the
others as explanatory variables.
tree(formula = Species ~ ., data =
Variables actually used in tree
Number of terminal nodes: 6
Residual mean deviance:
0.1253 = 18.05 / 144
Misclassification error rate:
0.02667 = 4 / 150
We used a classification tree to help group hurricanes by genesis type in
classification of Atlantic basin hurricanes, Journal of
Climate, v9, 2880-2889, 1996.
7. Additional Reading and References
Baclawski, K., 2008: Introduction to
Probability with R, Chapman & Hall/CRC.
Crawley, M.J., 2007: The R Book, John Wiley & Sons.
Dalgaard, P., 2002: Introductory Statistics with R, Springer.
Maindonald, J.H., 2004:
R for Data Analysis and Graphics: Introduction, Code and
Murrell, P., 2006: R Graphics, Chapman &
Tufte, E.R, 2001: The Visual Display of Quantitative Information,
2nd Ed., Graphics Press.
Verzani, J., 2004: Using R for Introductory Statistics, Taylor
UCLA Academic Technology Services
8. Other Material
random variable X_n = 1, if the nth toss of a coin is heads and X_n = -1 if it
is tails. Now form a new random variable S_n = X_1 + X_2 + ... + X_n, then
S_n is the position of a random walk after n steps: A step to the right is +1, a
step to the left is -1, so the sum of the first n steps is the position of the
walker at that time.
To plot the walk, type:
The results of random walk analysis have been applied to computer
science, physics, climatology, ecology, economics and a number of other fields
as a fundamental model for random processes in time. For example, the path
traced by a molecule as it travels in a liquid or a gas, the search path of a
foraging animal, the price of a fluctuating stock and the financial status of a
gambler can all be modeled as random walks.
To animate the random walk, we define a function that draws the graph up to time
x. For this we use the indexing function to perform the selection on the
random walk, and then plot the graph with a suitable pause.
plot(1:k, walk[support<=k], xlim=c(1,n),
To run it, type:
Try running a few times to acquire some intuition
about what a random walk looks like.
Event table to time