“Statistics is the grammar of science.”—Karl Pearson
data files: US.txt annual.RData bi.csv
packages: xtable, pscl, plotmo, earth, ggplot2, reshape
third party: roc.R
Here we focus on statistical models for understanding and predicting hurricane climate. We begin with exploratory analysis and then show you how to model counts with Poisson regression. Issues of model fit, interpretation, and prediction are considered in turn.
The topic of how to assess forecast skill is also examined including how to perform cross validation. Alternatives to the Poisson regression model are considered. Logistic regession and receiver operating characteristics are also covered.
The data set US.txt contains a list of tropical cyclone counts by year. The counts indicate the number of hurricanes hitting in the United States (excluding Hawaii). Input the data, save them as a data frame object, and print out the first six lines by typing
con = "http://www.hurricaneclimate.com/storage/chapter-7/US.txt"
H = read.table(con, header = TRUE)
head(H)
## Year All MUS G FL E
## 1 1851 1 1 0 1 0
## 2 1852 3 1 1 2 0
## 3 1853 0 0 0 0 0
## 4 1854 2 1 1 0 1
## 5 1855 1 1 1 0 0
## 6 1856 2 1 1 1 0
The columns include year Year, number of U.S. hurricanes All, number of major U.S. hurricanes MUS, number of U.S. Gulf coast hurricanes G, number of Florida hurricanes FL, and number of East coast hurricanes E.
To make subsequent analyses easier save the number of years in the record as n and the average number hurricanes per year as rate.
n = length(H$Year)
rate = mean(H$All)
n
## [1] 160
rate
## [1] 1.694
The average number of U.S. hurricanes is 1.69 per year over these 160 years.
Show your data. Here you plot the time series and distribution of the annual counts. Together the two plots provide a nice summary of the information in your data relevant to any modeling effort.
layout(matrix(c(1, 2), 1, 2, byrow = TRUE), widths = c(0.57, 0.43))
plot(H$Year, H$All, type = "h", xlab = "Year", ylab = "Hurricane count")
grid()
points(H$Year, H$All, type = "h")
mtext("a", side = 3, line = 1, adj = 0, cex = 1.1)
barplot(table(H$All), xlab = "Hurricane Count", ylab = "Number of years", main = "")
mtext("b", side = 3, line = 1, adj = 0, cex = 1.1)
The year-to-year variability and the distribution of counts appear to be consistent with a random count process. There are 34 years without a hurricane and one year (1886) with seven hurricanes. The number of years with a particular hurricane count provides a histogram.
The shape of the histogram suggests a Poisson distribution might be a good description for these data. The density function of the Poisson distribution shows the probability \( p \) of obtaining a count \( x \) when the mean count (rate) is \( \lambda \) is given by \[ p(x) = \frac{\mathrm{e}^{-\lambda} \lambda^x}{x!}, \] where \( \mathrm{e} \) is the exponential function and ! is the factorial symbol. The equation indicates that probability of no events is \( p(0) = \mathrm{e}^{-\lambda} \).
With \( \lambda \) =1.69 hurricanes per year, the probability of no hurricanes in a random year is
exp(-rate)
## [1] 0.1838
The probability of at least one hurricane is 1 minus this probability.
Using the dpois() function you can determine the probability for any number of hurricanes. For example, to determine the probability of observing exactly one hurricane when the rate is 1.69 hurricanes per year, type
dpois(x = 1, lambda = rate)
## [1] 0.3114
Or the probability of five hurricanes expressed in percent is
dpois(5, rate) * 100
## [1] 2.135
Recall you can leave off the argument names in the function if the argument values are placed in the default order.
The argument default order can be found by placing a question mark in front of the function name and leaving off the parentheses. This brings up the function's help page.
To answer the question, 'what is the probability of two or fewer hurricanes?', you use the cumulative probability function ppois() as follows
ppois(q = 2, lambda = rate)
## [1] 0.7589
Then to answer the question, 'what is the probability of more than two hurricanes?', you add the argument lower.tail=FALSE.
ppois(q = 2, lambda = rate, lower.tail = FALSE)
## [1] 0.2411
The Poisson distribution has the property that the variance is equal to the mean. Thus the ratio of the variance to the mean is one. You compute this ratio from your data by typing
round(var(H$All)/rate, 2)
## [1] 1.24
This says that the variance of hurricane counts is 24% larger than the mean. Is this unusual for a Poisson distribution?
You check by performing a Monte Carlo (MC) simulation experiment. A MC simulation relies on repeated random sampling. A single random sample of size \( n \) from a Poisson distribution with a rate equal to 1.5 is obtained by typing
rpois(n = 5, lambda = 1.5)
## [1] 1 1 1 1 3
Now repeat this \( m=1000 \) times. Let \( n \) be the number of years in your hurricane record and \( \lambda \) be the rate. For each sample you compute the ratio of the variance to the mean.
set.seed(3042)
ratio = numeric()
m = 1000
for (i in 1:m) {
h = rpois(n = n, lambda = rate)
ratio[i] = var(h)/mean(h)
}
The vector ratio contains 1000 values of the ratio. To help answer the 'Is this unusual?' question, you determine the proportion of ratios greater than 1.24
sum(ratio > var(H$All)/rate)/m
## [1] 0.028
Only 2.8% of the ratios are larger, so the answer from your MC experiment is 'yes,' the variability in hurricane counts is higher than you would expect (unusual) from a Poisson distribution with a constant rate.
This indicates that the rate varies over time. Although you can compute a long-term average, some years have a higher rate than others.
The variation in the rate is due to things like El Nino. So you expect more variance (extra dispersion) in counts relative to a constant rate (homogeneous Poisson) distribution.
This is the basis behind seasonal forecasts. Note that a variation in the annual rate is not obvious from looking at the variation in counts. Even with a constant rate the counts will vary.
You modify your MC simulation using the gamma distribution for the rate and then examine the ratio of variance to the mean from a set of Poisson counts with the variable rate. The gamma distribution describes the variability in the rate using the shape and scale parameters.
The mean of the gamma distribution is the shape times the scale. You specify the shape to be 5.6 and the scale to be 0.3 so the product matches closely the long-term average count. You could, of course, choose other values that produce the same average.
Now your simulation first generates 1000 random gamma values and then for each gamma 160 years of hurricane counts are generated.
ratio = numeric()
set.seed(3042)
m = 1000
for (i in 1:m) {
h = rpois(n = n, lambda = rgamma(m, shape = 5.6, scale = 0.3))
ratio[i] = var(h)/mean(h)
}
sum(ratio > var(H$All)/rate)/m
## [1] 0.616
In this case you find 61.6% of the ratios are larger, so you conclude that the observed hurricane counts are more consistent with a variable rate (inhomogeneous) Poisson model.
The examples demonstrate an important use of statistics; to simulate data that have the same characteristics as your observations.
The figure below is a plot of the observed hurricane record over the 160-year period together with plots from three simulated records of the same length and having the same over-dispersed Poisson variation as the observed record. Simulated records provide a way to test hypotheses about natural variability in hurricane climate.
par(mfrow = c(2, 2))
plot(H$Year, H$All, type = "h", ylim = c(0, 10), xlab = "Year", ylab = "Hurricane count")
grid()
points(H$Year, H$All, type = "h")
mtext("a", side = 3, line = 1, adj = 0, cex = 1.1)
hh = rpois(n, lambda = rgamma(n, 5.6, scale = 0.3))
plot(H$Year, hh, type = "h", ylim = c(0, 10), xlab = "Year", ylab = "Hurricane count")
grid()
points(H$Year, hh, type = "h")
mtext("b", side = 3, line = 1, adj = 0, cex = 1.1)
hh = rpois(n, lambda = rgamma(n, 5.6, scale = 0.3))
plot(H$Year, hh, type = "h", ylim = c(0, 10), xlab = "Year", ylab = "Hurricane count")
grid()
points(H$Year, hh, type = "h")
mtext("c", side = 3, line = 1, adj = 0, cex = 1.1)
hh = rpois(n, lambda = rgamma(n, 5.6, scale = 0.3))
plot(H$Year, hh, type = "h", ylim = c(0, 10), xlab = "Year", ylab = "Hurricane count")
grid()
points(H$Year, hh, type = "h")
mtext("d", side = 3, line = 1, adj = 0, cex = 1.1)
Summary characteristics of a 100 years of hurricanes at a coastal location may be of little value, but running a sediment transport model at that location with a large number of simulated hurricane counts will provide an assessment of the uncertainty in sediment movement caused by natural variation in hurricane frequency.
The parameter of interest is the annual rate. Given the rate, you have a probability distribution for any possible hurricane count. You noted the observed counts are consistent with a Poisson model having a variable rate. But where does this variability come from?
On the annual time scale to a first order high ocean heat content and cold upper air temperature provide the fuel for a hurricane, a calm atmosphere allows a hurricane to intensify, and the position and strength of the subtropical high pressure steers a hurricane that does form.
Thus, hurricane activity responds to changes in climate conditions that affect these factors including sea-surface temperature (SST) as indicator of oceanic heat content, sunspot number as an indicator of upper air temperature, El Nino-Southern Oscillation (ENSO) as indicator of wind shear, and the North Atlantic Oscillation as an indicator of steering flow.
SST provides an indication of the thermodynamic environment as do sunspots. An increase in solar UV radiation during periods of strong solar activity will have a suppressing affect on tropical cyclone intensity as the air above the hurricane will warm through absorption of radiation by ozone and modulated by dynamic effects in the stratosphere (Elsner and Jagger (2008).
ENSO is characterized by basin-scale fluctuations in sea-level pressure across the equatorial Pacific Ocean. The Southern Oscillation Index (SOI) is defined as the normalized sea-level pressure difference between Tahiti and Darwin, and values are available back through the middle 19th century. The SOI is strongly anti-correlated with equatorial Pacific SSTs so that an El Nino warming event is associated with negative SOI values.
The NAO is characterized by fluctuations in sea-level pressure (SLP) differences. Monthly values are an indicator of the strength and/or position of the subtropical Bermuda High. The relationship might result from a teleconnection between the midlatitudes and tropics whereby a below normal NAO during the spring leads to dry conditions over the continents and to a tendency for greater summer/fall middle tropospheric ridging (enhancing the dry conditions).
Ridging over the eastern and western sides of the North Atlantic basin tends to keep the middle tropospheric trough, responsible for hurricane recurvature, farther to the north during the peak of the season (Elsner and Jagger 2006).
With the exception of the NAO index the monthly values are averages over the three months of August through October. The NAO index is averaged over the two months of May and June.
Consider the relationship between hurricane frequency and a single environmental variable (covariate). Scatter plots are not very useful as many different covariate values exist for a given count. Instead it is better to plot a summary of the covariate distribution for each count.
The five-number summary provides information about the median, the range, and the quartile values of a distribution. So for example you compare the five number summary of the NAO during years with no hurricanes and during years with three hurricanes by typing
load("annual.RData")
nao0 = annual$nao[H$All == 0]
nao3 = annual$nao[H$All == 3]
fivenum(nao0)
## [1] -2.030 -0.765 -0.165 0.600 1.725
fivenum(nao3)
## [1] -2.655 -1.315 -0.685 -0.085 1.210
For each quantile of the five number summary, the NAO value is lower when there are three hurricanes compared to when there are no hurricanes. A plot showing the five number summary values for all years and all covariates. Note that the covariate is by convention plotted on the horizontal axis in a scatter plot, so you make the lines horizontal.
par(mfrow = c(2, 2))
plot(range(annual$sst, na.rm = TRUE), c(0, 7), type = "n", ylab = "Hurricane count",
xlab = "SST [C]")
mtext("a", side = 3, line = 1, adj = 0, cex = 1.1)
for (i in 0:7) {
points(fivenum(annual$sst[H$All == i])[3], i, pch = 19)
lines(c(fivenum(annual$sst[H$All == i])[1], fivenum(annual$sst[H$All ==
i])[2]), c(i, i))
lines(c(fivenum(annual$sst[H$All == i])[4], fivenum(annual$sst[H$All ==
i])[5]), c(i, i))
}
plot(range(annual$soi, na.rm = TRUE), c(0, 7), type = "n", ylab = "Hurricane count",
xlab = "SOI [s.d.]")
mtext("b", side = 3, line = 1, adj = 0, cex = 1.1)
for (i in 0:7) {
points(fivenum(annual$soi[H$All == i])[3], i, pch = 19)
lines(c(fivenum(annual$soi[H$All == i])[1], fivenum(annual$soi[H$All ==
i])[2]), c(i, i))
lines(c(fivenum(annual$soi[H$All == i])[4], fivenum(annual$soi[H$All ==
i])[5]), c(i, i))
}
plot(range(annual$nao, na.rm = TRUE), c(0, 7), type = "n", ylab = "Hurricane count",
xlab = "NAO [s.d.]")
mtext("c", side = 3, line = 1, adj = 0, cex = 1.1)
for (i in 0:7) {
points(fivenum(annual$nao[H$All == i])[3], i, pch = 19)
lines(c(fivenum(annual$nao[H$All == i])[1], fivenum(annual$nao[H$All ==
i])[2]), c(i, i))
lines(c(fivenum(annual$nao[H$All == i])[4], fivenum(annual$nao[H$All ==
i])[5]), c(i, i))
}
plot(range(annual$ssn, na.rm = TRUE), c(0, 7), type = "n", ylab = "Hurricane count",
xlab = "Sunspot number")
mtext("d", side = 3, line = 1, adj = 0, cex = 1.1)
for (i in 0:7) {
points(fivenum(annual$ssn[H$All == i])[3], i, pch = 19)
lines(c(fivenum(annual$ssn[H$All == i])[1], fivenum(annual$ssn[H$All ==
i])[2]), c(i, i))
lines(c(fivenum(annual$ssn[H$All == i])[4], fivenum(annual$ssn[H$All ==
i])[5]), c(i, i))
}
The plots show the SOI and NAO are likely important in statistically explaining hurricane counts as there appears to be a fairly systematic variation in counts across their range of values. The variation is less clear with SST and sunspot number.
Note that the bivariate relationships do not necessarily tell the entire story. Covariates may contribute redundant information to the response variable so they all might be significant in a multivariate model.
The model of choice for count data is Poisson regression. Poisson regression assumes the response variable has a Poisson distribution and the logarithm of the expected value of the response variable is modeled with a linear combination of explanatory variables (log-linear model).
The linear regression model is not appropriate for count data. To illustrate, here you regress U.S. hurricane counts on the four explanatory variables (covariates) described above. You then use the model to make predictions specifying the SOI and NAO at three standard deviation departures from the average, a large sunspot number, and an average SST value.
To make thing a bit simpler, you first create a data frame object by typing
df = data.frame(All = H$All, SOI = annual$soi, NAO = annual$nao, SST = annual$sst,
SSN = annual$ssn)
df = df[-(1:15), ]
Here the data frame object df has columns with labels All, SOI, NAO, SST, and SSN corresponding to the response variable U.S. hurricane counts and the four explanatory variables. You remove the first fifteen years because of missing SOI values.
You then create a linear regression model object using the lm() function specifying the response and covariates accordingly.
lrm = lm(All ~ SOI + NAO + SST + SSN, data = df)
Your model is saved in lrm.
Next you use the predict() method on the model object together with explanatory values specified using the newdata argument. The names must match those used in your model object and each explanatory variable must have a value.
predict(lrm, newdata = data.frame(SOI = -3, NAO = 3, SST = 0, SSN = 250))
## 1
## -0.3181
The prediction results in a negative number that is not a count. It indicates the climate conditions are unfavorable for hurricanes, which is true but the number has no physical meaning. This is a problem.
A Poisson regression model that specifies the logarithm of the annual hurricane rate is an alternative. The assumption is that the hurricanes are independent in the sense that the arrival of one hurricane will not make another one more or less likely, but the rate of hurricanes varies from year to year due to the covariates.
The Poisson regression model is expressed as \[ \log(\lambda) = \beta_0+\beta_1 x_1 + \ldots +\beta_p x_p \] Here there are \( p \) covariates (indicated by the \( x_i \)'s) and \( p+1 \) parameters (\( \beta_i \)'s).
The model uses the logarithm of the rate as the response variable, but it is linear in the regression structure. It is not the same as a linear regression on the logarithm of counts. The model coefficients are determined by the method of maximum likelihood.
It is important to understand that with a Poisson regression you cannot explain all the variation in the observed counts; there will always be unexplainable variation due to the stochastic nature of the process.
Thus even if the model precisely predicts the rate of hurricanes, the set of predicted counts will have a degree of variability that cannot be reduced by the model (aleatory uncertainty). Conversely, if you think you can explain all (or even most) of the variability in the counts, then Poisson regression is not the appropriate model.
The method of maximum likelihood is employed in the glm() function to determine the model coefficients. The Poisson regression model is a type of generalized linear model (GLM) in which the logarithm of the rate of occurrence is a linear function of the covariates (predictors).
To fit a Poisson regression model to U.S. hurricanes and save the model as an object, type
prm = glm(All ~ SOI + NAO + SST + SSN, data = df, family = "poisson")
The model formula is identical to what you used to fit the linear regression model above. The difference is now you specify the argument family=“poisson”.
You examine model coefficients by typing
summary(prm)
##
## Call:
## glm(formula = All ~ SOI + NAO + SST + SSN, family = "poisson",
## data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.853 -0.898 -0.138 0.603 2.472
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.59529 0.10334 5.76 8.4e-09 ***
## SOI 0.06186 0.02132 2.90 0.0037 **
## NAO -0.16660 0.06443 -2.59 0.0097 **
## SST 0.22897 0.25529 0.90 0.3698
## SSN -0.00231 0.00137 -1.68 0.0928 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 197.89 on 144 degrees of freedom
## Residual deviance: 174.81 on 140 degrees of freedom
## AIC: 479.6
##
## Number of Fisher Scoring iterations: 5
You can output the results directly to a latex table.
require(xtable)
## Loading required package: xtable
tbl = xtable(summary(prm), label = "tab:modelcoef", caption = "Coefficients of the Poisson regression model.")
print(tbl, math.style.negative = TRUE, caption.placement = "top")
## % latex table generated in R 2.15.2 by xtable 1.7-0 package
## % Thu Mar 28 13:52:58 2013
## \begin{table}[ht]
## \begin{center}
## \caption{Coefficients of the Poisson regression model.}
## \label{tab:modelcoef}
## \begin{tabular}{rrrrr}
## \hline
## & Estimate & Std. Error & z value & Pr($>$$|$z$|$) \\
## \hline
## (Intercept) & 0.5953 & 0.1033 & 5.76 & 0.0000 \\
## SOI & 0.0619 & 0.0213 & 2.90 & 0.0037 \\
## NAO & $-$0.1666 & 0.0644 & $-$2.59 & 0.0097 \\
## SST & 0.2290 & 0.2553 & 0.90 & 0.3698 \\
## SSN & $-$0.0023 & 0.0014 & $-$1.68 & 0.0928 \\
## \hline
## \end{tabular}
## \end{center}
## \end{table}
As anticipated from the bivariate relationships, the SOI and SST variables are positively related to the rate of U.S. hurricanes and the NAO and sunspot number are negatively related.
You can see that the coefficient on SST is positive, but not statistically significant. The SOI and NAO have coefficients large enought to provide convincing evidence against the null hypothesis. The coefficient on the SSN provides suggestive, but inconclusive evidence against the null.
Statistical significance is based on a null hypothesis that the coefficient value is zero. The ratio of the estimated coefficient to its standard error (\( z \)-value) has an approximate standard normal distribution assuming the null is true. The probability of finding a \( z \)-value this extreme or more so is your \( p \)-value. The smaller the \( p \)-value, the less support there is for the null hypothesis given your data and model.
You use the plotmo() function from the plotmo package (Milborrow 2012) to plot your model's response when varying one covariate while holding the other covariates constant at their median values.
require(plotmo)
plotmo(prm)
##
## grid: SOI NAO SST SSN
## -0.2667 -0.43 -0.05507 48.1
As SOI and SST increase so does hurricane rate. As NAO and SSN increase the rate decreases. The curvature arises from the fact that the covariates are related to the counts through the logarithm of the rate.
The confidence bands are based on pointwise \( \pm \) 2 standard errors. Note the relatively large width for SOI values above 5 s.d. and for NAO values below -2 s.d.
Interpretation of the Poisson regression coefficients is different than the interpretation of linear regression coefficients.
The coefficient value on the SOI indicates that for every one standard deviation (s.d.) increase in the SOI, the difference in the logarithm of hurricane rates is 0.062.
Since there are other covariates, you must add 'given that the other covariates in the model are held constant.'
But what does the difference in the logarithm mean? Note that \[ \log \mathrm{A} - \log \mathrm{B} = \log(\frac{\mathrm{A}}{\mathrm{B}}) \] so exponentiating the SOI coefficient value provides a ratio of the hurricane rates for a unit change in the SOI. You do this by typing
exp(summary(prm)$coefficients[2, 1])
## [1] 1.064
and find that for every one s.d. increase in SOI, the hurricane rate increases by a factor of 1.06 or 6%.
Similarly, since the NAO coefficient value is -0.167 you find that for every one s.d. increase in the NAO, the hurricane rate decreases by a factor of 15%.
Given the model coefficients obtained by the method of maximum likelihood using the glm() function and saved as a model object, you make predictions using the predict() method. For comparison above here you predict the rate of hurricanes given the coefficient values used above for the linear regression model.
predict(prm, newdata = data.frame(SOI = -3, NAO = 3, SST = 0, SSN = 250), type = "response")
## 1
## 0.5135
The argument type=“response” gives the prediction in terms of the mean response (hurricane rate). By default type=“link” which results in a prediction in terms of the link function (here the logarithm of the mean response).
Recall the linear regression model gave a prediction that was physically unrealistic. Here the predicted value indicates a small hurricane rate as you would expect given the covariate values, but the rate is a realistic non-negative number.
The predicted rate together with the equation for the Poisson distribution provides a probability for each possible count.
To see this you create two bar plots, one for a forecast of hurricanes under unfavorable conditions and another for a forecast of hurricanes under favorable conditions. First you save the predicted rate for values of the covariates that are favorable and unfavorable for hurricanes.
You then create a vector of counts from zero to six that is used as the set of quantiles for the dpois() function and as the names argument in the barplot() function. The plotting parameters are set using the par() function.
To make it easier to compare the probability distributions, limits on the vertical axis (ylim) are set the same.
fav = predict(prm, newdata = data.frame(SOI = 2, NAO = -2, SST = 0, SSN = 50),
type = "response")
ufa = predict(prm, newdata = data.frame(SOI = -2, NAO = 2, SST = 0, SSN = 200),
type = "response")
h = 0:6
par(mfrow = c(1, 2), las = 1)
barplot(dpois(x = h, lambda = ufa), ylim = c(0, 0.5), names.arg = h, xlab = "Number of Hurricanes",
ylab = "Probability")
mtext("a", side = 3, line = 1, adj = 0, cex = 1.1)
barplot(dpois(x = h, lambda = fav), ylim = c(0, 0.5), names.arg = h, xlab = "Number of Hurricanes",
ylab = "Probability")
mtext("b", side = 3, line = 1, adj = 0, cex = 1.1)
The forecast probability of two or more hurricanes is 72% in years with favorable conditions, but decreases to 16% in years with unfavorable conditions. The probability of no hurricanes during years with favorable conditions is 8%, which compares with a probability of 48% during years with unfavorable conditions. Using the data, the model translates swings in climate to landfall probabilities.
Forecast skill refers to how well your predictions match the observations. There are several ways to quantify this match. Here you consider three of the most common, the mean absolute error (MAE), the mean squared error (MSE) and the correlation coefficient ®.
Let \( \lambda_i \) be the predicted rate for year \( i \) and \( o_i \) be the corresponding observed count for that year. Then the three measures of skill over \( n \) years are defined by \[ \mathrm{MAE} = \frac{1}{n}\sum_{i=1}^n \left| \lambda_i-o_i\right| \\ \mathrm{MSE} = \frac{1}{n}\sum_{i=1}^n \left( \lambda_i-o_i\right)^2 \\ \mathrm{r} = \frac{\sum (\lambda_i - \bar \lambda) (o_i - \bar o)}{\sqrt{\sum (\lambda_i - \bar \lambda)^2 \sum (o_i - \bar o)^2}} \] You obtain the predicted rate for each year in the record (\( \lambda_i \)) by typing
prm = glm(All ~ SOI + NAO + SSN, data = df, family = "poisson")
pr = predict(prm, type = "response")
You first create a new model removing the insignificant SST covariate. Since each prediction is made for a year with a known hurricane count, it is referred to as a 'hindcast.' The word 'forecast' is reserved for a prediction made for a year where the hurricane count is unknown (typically in the future).
The Poisson regression hindcasts are given in terms of rates while the observations are counts. So instead of using a rate in the first two formulae above you use the probability distribution of observing \( j= 0, 1, \ldots, \infty \) hurricanes. A probabilistic form of the above formulae are \[ \mathrm{MAEp} = \frac{1}{n} \sum_{i=1}^{n} \sum_{j=0}^{\infty} \mathrm{dpois}(j,\lambda_i) \cdot \left|j-o_i(j)\right| \\ \mathrm{MSEp} = \frac{1}{n} \sum_{i=1}^{n} \sum_{j=0}^{\infty} \mathrm{dpois}(j,\lambda_i) \cdot \left(j-o_i(j)\right)^2 \\ = \mathrm{MSE} + \bar \lambda \] where \( \mathrm{dpois}(j, h_i) \) is the probability of \( j \) hurricanes in the \( i \) th year given the predicted rate \( \lambda_i \), and \( o_i(j) \) is one when \( j \) is the observed number of hurricanes during the $i$th year and zero, otherwise. Note there is no corresponding probabilistic form to the correlation as a measure of forecast skill.
A prediction model is deemed useful if the skill level exceeds the level of a naive reference model. The percentage above the skill obtained from a naive model is referred to useful skill. The naive model is typically climatology. To obtain the climatological rate you type
clim = glm(All ~ 1, data = df, family = "poisson")
cr = predict(clim, type = "response")
Here the term to the right of the tilde is a 1. The model predicts a single value representing the mean hurricane rate over the period of record. The value is the same for each year.
The table below shows skill metrics for your U.S. hurricane model and the percentage of useful skill relative to climatology. Note that the correlation is undefined for the climatological forecast.
j = 0:15
n = length(df$All)
poisson = c(mean(abs(df$All - pr)), mean((df$All - pr)^2), cor(df$All, pr),
sum(outer(j, pr, dpois) * outer(j, df$All, function(x, y) abs(x - y)))/n,
sum(outer(j, pr, dpois) * outer(j, df$All, function(x, y) (x - y)^2))/n)
climatology = c(mean(abs(df$All - cr)), mean((df$All - cr)^2), NA, sum(outer(j,
cr, dpois) * outer(j, df$All, function(x, y) abs(x - y)))/n, sum(outer(j,
cr, dpois) * outer(j, df$All, function(x, y) (x - y)^2))/n)
useful = (climatology - poisson)/climatology * 100
skilltable = data.frame(Poisson = poisson, Climatology = climatology, Useful = useful)
rownames(skilltable) = c("MAE", "MSE", "r", "MAEp", "MSEp")
tbl = xtable(skilltable, digits = 2, label = "tab:skillmetricsin", caption = "Forecast skill (in sample).")
print(tbl, math.style.negative = TRUE, caption.placement = "top")
## % latex table generated in R 2.15.2 by xtable 1.7-0 package
## % Thu Mar 28 13:52:58 2013
## \begin{table}[ht]
## \begin{center}
## \caption{Forecast skill (in sample).}
## \label{tab:skillmetricsin}
## \begin{tabular}{rrrr}
## \hline
## & Poisson & Climatology & Useful \\
## \hline
## MAE & 1.08 & 1.16 & 7.04 \\
## MSE & 1.93 & 2.18 & 11.24 \\
## r & 0.34 & & \\
## MAEp & 1.44 & 1.50 & 4.08 \\
## MSEp & 3.67 & 3.92 & 6.26 \\
## \hline
## \end{tabular}
## \end{center}
## \end{table}
prm.r = skilltable$Poisson[3]
The useful skill level is between 4.1 and 11.2% relative to climatology. While not high, it represents a significant improvement.
The above procedure results in an in-sample assessment of forecast skill. All years of data are used to estimate a single set of model coefficients with the model subsequently used to hindcast each year's hurricane activity. But how well will your model perform when making predictions of the future?
This question is best answered with an out-of-sample assessment of skill. An out-of-sample assessment (1) excludes a single year of observations, (2) determines the MLE coefficients of the Poisson regression model using observations from the remaining years, and (3) uses the model to predict the hurricane count for the year left out.
This is done \( n \) times, removing each year's worth of observations successively. The above skill metrics are then used on these out-of-sample predictions. The procedure is called 'cross validation' and where single cases are left out, it is called leave-one-out cross validation (LOOCV).
To perform LOOCV on your Poisson regression model you loop over all years using the index i. Within the loop you determine the model using all years except i (df[-i, ] in the data argument). You then make a prediction only for the year you left out (newdata=df[i, ]). Note that your climatology model is cross validated as well.
j = 0:15
n = length(df$All)
prx = numeric()
crx = numeric()
for (i in 1:n) {
prm = glm(All ~ SOI + NAO + SSN, data = df[-i, ], family = "poisson")
clm = glm(All ~ 1, data = df[-i, ], family = "poisson")
prx[i] = predict(prm, newdata = df[i, ], type = "r")
crx[i] = predict(clm, newdata = df[i, ], type = "r")
}
Skill assessment is done in the same way as for in-sample assessment. The results of the cross-validation assessment of model skill are give in the table below.
poisson = c(mean(abs(df$All - prx)), mean((df$All - prx)^2), cor(df$All, prx),
sum(outer(j, prx, dpois) * outer(j, df$All, function(x, y) abs(x - y)))/n,
sum(outer(j, prx, dpois) * outer(j, df$All, function(x, y) (x - y)^2))/n)
climatology = c(mean(abs(df$All - crx)), mean((df$All - crx)^2), NA, sum(outer(j,
crx, dpois) * outer(j, df$All, function(x, y) abs(x - y)))/n, sum(outer(j,
crx, dpois) * outer(j, df$All, function(x, y) (x - y)^2))/n)
useful = (climatology - poisson)/climatology * 100
skilltableCV = data.frame(Poisson = poisson, Climatology = climatology, Useful = useful)
rownames(skilltableCV) = c("MAE", "MSE", "r", "MAEp", "MSEp")
tbl = xtable(skilltableCV, digits = 2, label = "tab:skillmetricsout", caption = "Forecast skill (out of sample).")
print(tbl, math.style.negative = TRUE, caption.placement = "top")
## % latex table generated in R 2.15.2 by xtable 1.7-0 package
## % Thu Mar 28 13:53:00 2013
## \begin{table}[ht]
## \begin{center}
## \caption{Forecast skill (out of sample).}
## \label{tab:skillmetricsout}
## \begin{tabular}{rrrr}
## \hline
## & Poisson & Climatology & Useful \\
## \hline
## MAE & 1.11 & 1.16 & 4.91 \\
## MSE & 2.05 & 2.21 & 7.03 \\
## r & 0.26 & & \\
## MAEp & 1.46 & 1.51 & 2.81 \\
## MSEp & 3.80 & 3.95 & 3.78 \\
## \hline
## \end{tabular}
## \end{center}
## \end{table}
Out-of-sample skill levels are lower. However, this is an estimate of the average skill the model will have when used to make actual forecasts. Always show out-of-sample skill if you intend to use your model to predict the future.
The difference in percentage of usefulness between in-sample and out-of-sample skill is a measure of the over-fit in your model. Over-fit arises when your model interprets random fluctuations as signal. Cross validation helps you protect yourself against being fooled by this type of randomness.
Poisson regression specifies a linear relationship between your covariates and the logarithm of the hurricane rate. Linearity in the regression structure can be restrictive if the influence of a covariate changes over its range. Multivariate adaptive regression splines (MARS) is a form of regression introduced by Friedman (1991) that allows for such nonlinearities.
MARS builds models of the form \[ \hat f(x) = \sum_{i=1}^k c_i B_i(x) \] where \( B_i(x) \) is a basis function and \( c_i \) is a constant coefficient. The model is thus a weighted sum of the \( k \) basis functions. A basis function takes one of three forms: Either a constant representing the intercept term, a hinge function of the form max(\( 0, x - a \)), where \( a \) is a constant representing the knot for the hinge function, or a product of two or more hinge functions to allow the basis function the ability to handle interaction between two or more covariates. A hinge function is zero for part of its range, so it partitions your multivariate data into disjoint regions.
The earth() function in the earth package (Milborrow 2011) provides functionality for MARS. The syntax is similar to other models in R. Here you create a model using MARS for your hurricane counts and the same environmental covariates by typing
require(earth)
mars = earth(All ~ SOI + NAO + SST + SSN, data = df, glm = list(family = "poisson"))
A summary method on the model object indicates that only the SOI and NAO are selected as important in explaining hurricane rates. The correlation between the predicted rate and the observed counts is obtained by taking the square root of the R-squared value.
sqrt(mars$rsq)
## [1] 0.4694
This value exceeds the correlation from your Poisson regression by 39.7% suggesting you might have found a better prediction model.
The partial dependence plot shows the hinge functions for the two selected covariates.
source("plotmo.R")
par(las = 1, c(3.2, 0.4, 0), tcl = -0.3)
plotmo(mars, degree1 = c("SOI", "NAO"), lwd.degree1 = 2, caption = "", xlab = c("SOI [s.d.]",
"NAO [s.d.]"), main = "", ylab = "Annual rate [hur/yr]", cex.lab = 0.8)
##
## grid: SOI NAO SST SSN
## -0.2667 -0.43 -0.05507 48.1
The knots on the SOI are located at about $-$2 and $+$4 s.d. There is no relationship between the SOI and hurricane counts for the lowest values of SOI and there is a sharp inverse relationship for the largest values.
It is tempting to over interpret these graphs. The SOI-hurricane relationship is for median NAO values only. The single knot on the NAO indicates no relationship between the NAO and hurricane counts for values less than about $-$0.5 s.d. This applies only for median SOI values. There are no standard errors from which to obtain confidence bands.
Before making forecasts you use cross validation on your MARS to get an estimate of the correlation ® between observed and predicted using independent data.
You do this by specifying the number of cross validations with the nfold argument. Cross validation is done if the argument is greater than one.
The function earth() builds nfold cross-validated models. For each fold it builds a model with in-sample data and then uses the model to compute the R-squared value from predictions made on the out-of-sample data.
For instance, with nfold=2 the number of years in the in-sample and out-of-sample is roughly the same. The choice of years is chosen randomly so you set a seed to allow exact replication of your results.
Here you set nfold to 5 as a compromise between having enough data to both build the model and make predictions with it. You stabilize the variance further by specifying ncross to allow 40 different nfold cross validations.
set.seed(3042)
marsCV = earth(All ~ SOI + NAO, data = df, nfold = 5, ncross = 40, glm = list(family = "poisson"))
The results are saved in your marsCV object in column cv.rsq.tab. The last row gives the mean R-squared value that provides an estimate of the average skill your model will have when it is used to make actual forecasts.
The square root of that value is the correlation, obtained given by typing
rn = dim(marsCV$cv.rsq.tab)[1]
mars.r = sqrt(marsCV$cv.rsq.tab[rn, ][1])
mars.r
## All
## 0.2914
The mean r value is 11% higher than the r value from the HOOCV of your Poisson regression model. This is an improvement but well below that estimated from your in-sample skill.
The Poisson regression model is a good place to start when working with count data but it might not be good enough when data exhibit over-dispersion or when there are a large number of zero values.
Consider the question of whether your Poisson regression model of hurricane counts is adequate. You examine model adequacy using the residual deviance. The residual deviance is -2 times the log-likelihood ratio of a model without covariates compared to your model that has covariates.
The residual deviance along with the residual degrees of freedom are available from the summary method on your glm object.
prm = glm(All ~ SOI + NAO + SSN, data = df, family = "poisson")
s = summary(prm)
rd = s$deviance
dof = s$df.residual
Under the null hypothesis that your model is adequate the residual deviance has a \( \chi^2 \) distribution with degrees of freedom equal to the residual degrees of freedom.
Caution, as this reversed from the typical case where the null is the opposite of what you hope for. To obtain the \( p \)-value for a test of model adequacy type
pchisq(rd, dof, lower.tail = FALSE)
## [1] 0.02547
The residual deviance is 175.61 on 141 degrees of freedom resulting in a \( p \)-value of 0.0255. This provides you with evidence that something is missing.
The problem may be that hurricanes tend to arrive in clusters even after taking into account the covariates that influence hurricanes rates from year to year. This clustering produces over-dispersion in observed counts.
Another problem is when count data have many zeros. This is typical when there are two processes at work, one determining whether there is at least one event, and one determining how many events.
An example is the occurrence of cloud-to-ground lightning strikes inside a tropical cyclone. There will be many more hours with zero strikes due to convective processes that are different than those that produce one or more strikes.
These kinds of data can be handled with zero-inflated models.
Zero-inflated count models combine a point mass at zero and a count distribution. The gives you two sources of zeros: one from the point mass and the other from the count distribution. Usually the count model is a Poisson regression and the point mass is a binomial regression.
The zeroinfl() function in the pscl package Zeileis et al. (2008) is used to fit a zero-inflated model using the method of maximum likelihood.
The formula argument mainly describes the count data model, i.e., y ~ x1 + x2 specifies a count data regression where all zero counts have the same probability of belonging to the zero component.
This is equivalent to the model y ~ x1 + x2 | 1, making it more explicit that the zero-inflated model only has an intercept.
Additional predictors can be added so that not all zeros have the same probability of belonging to the point mass component or to the count component. A typical formula is y ~ x1 + x2 | z1 + z2. The covariates in the zero and the count component can be the same.
For example, to model your U.S. hurricane counts where the count model uses all four covariates and where the binomial model uses only the SST variable, type
require(pscl)
zim = zeroinfl(All ~ SOI + NAO + SST + SSN | SST, data = df)
The model syntax includes a vertical bar to separate the covariates between the two model components. The returned object is of class zeroinfl and is similar to the object of class glm.
The object contains list for the coefficients and terms of each model component. To get a summary of the model object, type
summary(zim)
##
## Call:
## zeroinfl(formula = All ~ SOI + NAO + SST + SSN | SST, data = df)
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -1.492 -0.774 -0.127 0.583 3.061
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.64087 0.10593 6.05 1.5e-09 ***
## SOI 0.06920 0.02236 3.09 0.0020 **
## NAO -0.16928 0.06554 -2.58 0.0098 **
## SST 0.57178 0.28547 2.00 0.0452 *
## SSN -0.00239 0.00143 -1.67 0.0942 .
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.44 2.03 -2.19 0.029 *
## SST 6.80 3.89 1.75 0.080 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 16
## Log-likelihood: -231 on 7 Df
Results show that the four covariates have a statistically significant influence on the number of U.S. hurricanes and that SST has a significant relationship with whether or not there will be at least one hurricane.
Note that the sign on the SST coefficient is positive in the zero-inflation component indicating that higher SST is associated with more years without hurricanes. The sign is also positive in the count component indicating that, given at least one hurricane, higher SST is associated withe a higher probability of two or more hurricanes.
at = 0:30
n = length(df$All)
len.at = length(at)
zrx = numeric(n)
zrxprob = matrix(0, len.at, n)
crx = numeric(n)
crxprob = matrix(0, len.at, n)
for (i in 1:n) {
zim = zeroinfl(All ~ SOI + NAO + SST + SSN | SST, data = df[-i, ], EM = TRUE)
clm = zeroinfl(All ~ 1, data = df[-i, ], EM = TRUE)
new = df[i, ]
zrx[i] = predict(zim, newdata = new, type = "response")
zrxprob[, i] = predict(zim, newdata = new, type = "prob", at = at)
crx[i] = predict(clm, newdata = new, type = "response")
crxprob[, i] = predict(clm, newdata = new, type = "prob", at = at)
}
zeroin = c(mean(abs(df$All - zrx)), mean((df$All - zrx)^2), cor(df$All, zrx),
sum(zrxprob * outer(at, df$All, function(x, y) abs(x - y)))/n, sum(zrxprob *
outer(at, df$All, function(x, y) (x - y)^2))/n)
climatology = c(mean(abs(df$All - crx)), mean((df$All - crx)^2), NA, sum(crxprob *
outer(at, df$All, function(x, y) abs(x - y)))/n, sum(crxprob * outer(at,
df$All, function(x, y) (x - y)^2))/n)
useful = (climatology - zeroin)/climatology * 100
skilltableCV2 = data.frame(ZeroIn = zeroin, Climatology = climatology, Useful = useful)
rownames(skilltableCV2) = c("MAE", "MSE", "r", "MAEp", "MSEp")
A cross-validation exercise indicates that the zero-inflated model performs slightly worse than the Poisson model.
You can remove the Poisson assumption all together by employing a machine-learning algorithm that searches your data to find patterns related to the annual counts. This is called 'data mining.'
A regression tree is a type of machine learning algorithm that outputs a series of decisions with each decision leading to a value of the response or to another decision.
If the value of the NAO is less than $-$1 s.d. for example, then the response is two hurricanes. If it is greater, then is the SOI greater than 0.5 s.d. and so on.
A single tree will capture the relationships between annual counts and your predictors.
To see how this works, import the annual basin-wide hurricane data and subset on years since 1950. Create a data frame containing only the basin-wide hurricane counts and SOI and SST as the two predictors.
load("annual.RData")
dat = subset(annual, Year >= 1950)
df = data.frame(H = dat$B.1, SOI = dat$soi, SST = dat$sst)
Then using the tree() function from the tree package (Ripley 2011), type
require(tree)
rt = tree(H ~ SOI + SST, data = df)
The model syntax is the same as before with the response variable to the left of the tilde and the covariates to the right. To plot the regression tree, type
plot(rt)
text(rt)
Instead of interpreting the parameter values from a table of coefficients you interpret a regression tree from an upside-down tree-like diagram. You start at the top.
The first branch is a split on the your SST variable at a value of 0.33. The split is a rule. Is the value of SST less than 0.33$\circ$C? If yes, branch to the left; if no, branch to the right.
All splits work this way. Following on the right, the next split is on SOI. If SOI is greater than 0.12 s.d., then the mean value of all years under these conditions is 10.8 hur/yr. This is the end of the branch (leaf).
You check this by typing
mean(df$H[df$SST >= 0.33 & df$SOI > 0.12])
## [1] 10.8
The model is fit using binary recursive partitioning. Splits are made along coordinate axes of SST and SOI so that on any branch, a split is chosen that maximally distinguishes the hurricane counts.
Splitting continues until the variables cannot be split or there are too few years (less than 6 by default). Here SST is the variable explaining the most variation in the counts so it gets selected first.
Again, the value at which the split occurs is based on maximizing the difference in counts between the two subsets. The tree has five branches.
In general the key questions are: which variables are best to use and which value gives the best split. The choice of variables is similar to the forward selection procedure of stepwise regression.
A prediction is made by determining which leaf is reached based on the values of your predictors. To determine the mean number of hurricanes when SOI is $-\( 2 s.d. and SST is 0.2 \)\circ$C, you use the predict() method and type
predict(rt, data.frame(SOI = -2, SST = 0.2))
## 1
## 7.348
The predicted value depends on the tree. The tree depends on what years are used to grow it. For example, regrow the tree by leaving the last year out and make a prediction using the same two predictor values.
rt2 = tree(H ~ SOI + SST, data = df[-61, ])
predict(rt2, data.frame(SOI = -2, SST = 0.2))
## 1
## 5.714
Results are different. Which prediction do you choose?
Forecast sensitivity occurs with all statistical models, but is more acute in models that contain a large number of parameters. Each branch in a regression tree is a parameter so with your two predictors the model has five parameters.
A random forest algorithm side steps the question of prediction choice by making predictions from many trees (Breiman 2001).
It creates a sample from the set of all years and grows a tree using data only from the sampled years. It then repeats the sampling and grows another tree.
Each tree gives a prediction and the mean is taken. The function randomForest() in the randomForest package provides a random forest algorithm. For example, type
require(randomForest)
rf = randomForest(H ~ SOI + SST, data = df)
By default the algorithm grows 500 trees. To make a prediction type,
predict(rf, data.frame(SOI = -2, SST = 0.2))
## 1
## 4.91
Regression trees and random forest algorithms tend to over fit your data especially when you are searching over a large set of potential predictors in noisy climate data.
Over fitting results in small in-sample error, but large out-of-sample error. Again, a cross-validation exercise is needed if you want to claim the algorithm has superior predictive skill.
Cross validation removes the noise specific to each year's set of observations and estimates how well the random forest algorithm finds prediction rules when this coincident information is unavailable.
For example, does the random forest algorithm provide better prediction skill than a Poisson regression?
To answer this question you arrange a HOOCV as follows.
n = length(df$H)
rfx = numeric(n)
prx = numeric(n)
for (i in 1:n) {
rfm = randomForest(H ~ SOI + SST, data = df[-i, ])
prm = glm(H ~ SOI + SST, data = df[-i, ], family = "poisson")
new = df[i, ]
rfx[i] = predict(rfm, newdata = new)
prx[i] = predict(prm, newdata = new, type = "response")
}
The out-of-sample mean-squared prediction error is computed by typing
mean((df$H - prx)^2)
## [1] 5.07
mean((df$H - rfx)^2)
## [1] 5.357
Results indicate that the Poisson regression performs slightly better than the random forest algorithm in this case although the difference is not statistically significant.
The correlation between the actual and predicted value is 0.539 for the Poisson model and 0.502 for the random forest algorithm.
With only a few variables you can examine the bivariate influence of the covariates on the response.
Hurricane counts increase with SST and SOI but for high values of SOI the influence of SST is stronger. Similarly for high values of SST the influence of the SOI is more pronounced.
The random forest model is able to capture non-linearities and thresholds, but at the expense of interpreting some noise as signal as seen by the relative high count with SOI values near $-$3 s.d. and SST values near $-\( 0.1 \)\circ$C.
newdat = expand.grid(SST = seq(-0.5, 0.7, 0.01), SOI = seq(-5, 4, 0.1))
z1 = predict(rf, newdata = newdat)
prm = glm(H ~ SOI + SST, data = df, family = "poisson")
z2 = predict(prm, newdata = newdat, type = "response")
newdat$Hrf = z1
newdat$Hpr = z2
require(lattice)
require(colorRamps)
require(grid)
cr = blue2green(20)
p1 = levelplot(Hrf ~ SST + SOI, data = newdat, at = seq(2, 12, 2), scales = list(tck = 0.5,
alternating = 2, cex = 0.7), xlab.top = textGrob("SST [C]", gp = gpar(cex = 0.8)),
ylab.right = textGrob("SOI [s.d.]", gp = gpar(cex = 0.8), rot = 90), xlab = "",
ylab = "", col.regions = cr, colorkey = list(space = "bottom", cex = 0.7),
sub = list("Hurricane count", cex = 0.9, font = 1))
p2 = levelplot(Hpr ~ SST + SOI, data = newdat, at = seq(2, 12, 2), scales = list(tck = 0.5,
alternating = 2, cex = 0.7), xlab.top = textGrob("SST [C]", gp = gpar(cex = 0.8)),
ylab.right = textGrob("SOI [s.d.]", gp = gpar(cex = 0.8), rot = 90), xlab = "",
ylab = "", col.regions = cr, colorkey = list(space = "bottom", cex = 0.7),
sub = list("Hurricane count", cex = 0.9, font = 1))
p1 = update(p1, main = textGrob("a", x = unit(0.05, "npc")))
p2 = update(p2, main = textGrob("b", x = unit(0.05, "npc")))
print(p1, split = c(1, 1, 2, 1), more = TRUE)
print(p2, split = c(2, 1, 2, 1), more = FALSE)
Some of our research in the 1990's focused on the climatology of tropical cyclones from non-tropical origins (Elsner et al. 1996, Kimberlain and Elsner 1998).
We analyzed each North Atlantic hurricane since 1944 to determine whether we could discern middle latitude influences on development. We classified hurricanes into tropical and baroclinic based on primary origin and development mechanisms.
We would like to have a model that predicts a hurricane's group membership based on where the hurricane originated.
Logistic regression is the model of choice when your response variable is dichotomous. The focus is to predict the occurrence of the event. A hurricane forecaster is keen about whether an area of disturbance will develop into a cyclone given the present atmospheric conditions.
Logistic regression is a generalization of the linear regression model (like Poisson regression) where the response variable does not have a normal distribution and the regression structure is linear in the covariates.
Like Poisson regression the model coefficients are determined using the method of maximum likelihood.
The mean of a binary variable is a percentage. The percentage of a particular outcome can be interpreted as a probability so it is denoted as \( \pi \). The logistic regression model specifies how \( \pi \) is related to a set of explanatory variables.
You input the hurricane data by typing
con = "http://www.hurricaneclimate.com/storage/chapter-7/bi.csv"
bh = read.csv(con, header = TRUE)
table(bh$Type)
##
## 0 1 3
## 187 77 73
The type as determined in Elsner et al. (1996) is indicated by the variable Type with 0 indicating tropical-only, 1 indicating baroclinic influences, and 3 indicating baroclinic initiation.
The typing was done subjectively using all the available synoptic information about each hurricane. While the majority of hurricanes form over warm ocean waters of the deep tropics ('tropical-only') some are aided in their formation by interactions with midlatitude jet stream winds ('baroclinically induced').
The stronger, tropical-only hurricanes develop farther south and primarily occur in August and September. The weaker, baroclinically-induced hurricanes occur throughout a longer season.
First combine the baroclinic types into a single group and add this column to the data frame.
bh$tb = as.integer(bh$Type != 0)
table(bh$tb)
##
## 0 1
## 187 150
With this grouping there are 187 tropical and 150 baroclinic hurricanes in the record of 337 cyclones.
Thus you can state that a hurricane drawn at random from this set of cyclones has about a 55% chance of being tropical only. Your interest is to improve on this climatological probability model by adding a covariate.
Here you consider the latitude at which the cyclone first reaches hurricane strength. As an exploratory step you create box plots of the latitudes grouped by hurricane type.
boxplot(bh$FirstLat ~ bh$tb, horizontal = TRUE, notch = TRUE, yaxt = "n", boxwex = 0.4,
xlab = "Latitude of Hurricane Formation (N)")
axis(2, at = c(1, 2), labels = c("Tropical", "Baroclinic"))
Here you make the box plots horizontal because latitude is your explanatory variable, which should be plotted on the horizontal axis.
With the argument notch set to true, notches are drawn on the box sides. The vertical dash inside the box is the median latitude. Notches extend to \( \pm 1.58 \times \) IQR/\( \sqrt{n} \), where IQR is the interquartile range and \( n \) is the sample size.
If the notches of two box plots do not overlap this provides evidence that the two medians are statistically different Chambers et al. (1983).
The median formation latitude for the set of tropical hurricanes is 17.9N and for the set of baroclinic hurricanes is farther north at 29.1N.
This makes physical sense as cyclones farther south are less likely to have influence from middle latitude baroclinic disturbances. The relatively small overlap between the two sets of latitudes strengthens your conviction that a latitude variable will improve a model for hurricane type.
Linear regression is not the appropriate model for binary data. It violates the assumption of equal variance and normality of residuals resulting in invalid standard errors and erroneous hypothesis tests. In its place you use a generalized linear model as you did above with the count data.
However, instead of using the logarithm as the link between the response and the covariates as you did in the Poission regression model, here you use the logit function. The logit of a number \( \pi \) between 0 and 1 is \[ \hbox{logit}(\pi)=\log\left( \frac{\pi}{1-\pi} \right) =\log(\pi)-\log(1-\pi). \] If \( \pi \) is a probability then \( \pi/(1-\pi) \) is the corresponding odds, and the logit of the probability is the logarithm of the odds. Odds are expressed as for:against (read: for to against) something happening. So the odds of a hurricane strike that is posted at 1:4 has a 20% chance of occurring.
The logistic regression model is expressed statistically as \[ \hbox{logit}(\pi) = \beta_0 + \beta_1 x_1 + \ldots +\beta_p x_p, \] where \( \pi \) is the mean. There are \( p \) covariates (\( x_i \)'s) and \( p+1 \) parameters (\( \beta_i \)'s).
To convert logit(\( \pi \)) to \( \pi \) (probability of occurrence) you use the logistic function (inverse of the logit function) given as \[ \hbox{logistic}(\alpha) = \frac{1}{1 + \exp(-\alpha)} = \frac{\exp(\alpha)}{1 + \exp(\alpha)}. \]
To fit a logistic regression model to hurricane type with latitude as the covariate and saving the model as an object, type
lorm = glm(tb ~ FirstLat, data = bh, family = "binomial")
The syntax is similar to the Poisson regression model, but here the family is binomial. The formula is read hurricane type is modeled as a function of formation latitude.'
The model coefficients are determined by the method of maximum likelihood in the glm() function. To produce a table of the coefficients you type
summary(lorm)$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.0826 0.96148 -9.446 3.504e-21
## FirstLat 0.3728 0.03947 9.447 3.494e-21
The estimated coefficients are listed by row. The coefficient for the intercept is the log odds of a hurricane at latitude of zero being baroclinic.
In other words, the odds of being baroclinic when the latitude is zero is exp(-9.0826) = 1.14 × 10-4.
These odds are very low, but that makes sense since no hurricanes form at the equator. So the intercept in the model corresponds to the log odds of a baroclinic hurricane when latitude is at the hypothetical equator.
Interest is on the coefficient of the formation latitude variable indicated by the row labeled FirstLat. The value is 0.373. Before fitting the model you anticipate the formation latitude coefficient to have a positive sign.
Why? Because baroclinic (tropical) type hurricanes are coded as 1 (0) in your data set and the box plots show that as formation latitude increases, the chance that a hurricane has baroclinic influences increases.
Note that if your response values are character strings (e.g., to' and
be') rather than coded as 0s and 1s, things will still work, but R will assign 0s and 1s based on alphabetical order and this will affect how you make sense of the coefficient's sign.
The magnitude of the coefficient is interpreted to mean that for every degree increase in formation latitude the log odds increases by a constant 0.373 units, on average.
This is not very informative. By taking the exponent of the coefficient value, the interpretation is in terms of an odds ratio.
exp(summary(lorm)$coefficients[2])
## [1] 1.452
Thus for every degree increase in formation latitude the odds ratio increases on average by a constant factor of 1.45 (or 45%). This increase does not depend on the value of latitude.
That is, logistic regression is linear in the odds ratio. The interpretation is valid only over the range of latidues in your data, and physically meaningless for latitudes outside the range where hurricanes occur.
The table of coefficients includes a standard error and \( p \)-value. Statistical significance is based on a null hypothesis that the coefficient is zero. The ratio of the estimated coefficient to its standard error (\( z \)-value) has an approximate standard normal distribution assuming the null is true.
The probability of finding a \( z \)-value this extreme or more extreme is the \( p \)-value. The smaller the \( p \)-value, the less support there is for the null hypothesis given the data and the model. The lack of support for the null allows us to accept our model.
A confidence interval on the estimated coefficient is obtained by typing
confint(lorm)[2, ]
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## 0.3011 0.4565
This is interpreted to mean that although your best estimate for the log odds of a baroclinic hurricane given latitude is 0.373, there is a 95% chance that the interval between 0.301 and 0.456 will cover the true log odds.
Predictions help you understand your model. To predict the probability that a hurricane picked at random from your data will be baroclinic given that its formation latitude is 20N latitude, you type
predict(lorm, newdata = data.frame(FirstLat = 20), type = "response")
## 1
## 0.1644
Thus the probability of a baroclinic hurricane forming at this low latitude is 16.4% on average.
To create a plot of predictions across a range of latitudes, first prepare a vector of latitudes. The vector spans the latitudes in your data set. You specify an increment of 0.1 degree so the resulting prediction curve is smooth.
Then use the predict() method and set se.fit to true. Save the prediction and the predictions corresponding to $\pm\( 1.96 \)\times$ the standard error.
lats = seq(min(bh$FirstLat), max(bh$FirstLat), 0.1)
probs = predict(lorm, newdata = data.frame(FirstLat = lats), type = "response",
se.fit = TRUE)
pm = probs$fit
pu = probs$fit + probs$se.fit * 1.96
pl = probs$fit - probs$se.fit * 1.96
Finally you plot the data points at 0 and 1 as you did above with the bar plot and add the predicted values using the lines() function.
plot(bh$FirstLat, bh$tb, pch = 19, cex = 0.4, ylab = "Probability", xlab = "Formation Latitude (N)")
grid()
lines(lats, pm, lwd = 2)
lines(lats, pu, lwd = 2, col = "red")
lines(lats, pl, lwd = 2, col = "red")
Tropical-only and baroclinically-enhanced hurricane points are shown along the \( y=0 \) and \( y=1 \) lines, respectively. The gray band is the 95% pointwise confidence interval.
Model predictions make sense. The probability of a baroclinically-enhanced hurricane is less than 20% at latitudes south of 20$\circ\( N. However, by latitude 24 \)\circ\( N, the probability exceeds 50% and by latitude 31 \)\circ$N the probability exceed 90%. The odds ratio is constant but the probability is a nonlinear function of latitude.
Output from a summary method applied to your model object prints statistics of model fit including null and deviance residuals and the AIC. These are shown below the table of coefficients. One measure of model fit is the significance of the overall model.
This test asks whether the model with latitude fits significantly better than a model with just an intercept. An intercept-only model is called a 'null' model (no covariates).
The test statistic is the difference between the residual deviance for the model with latitude and the null model. The test statistic has a \( \chi \)-squared distribution with degrees of freedom equal to the differences in degrees of freedom between the latitude model and the null model (i.e. the number of predictors in the model; here just one).
To find the difference in deviance between the two models (i.e. the test statistic) along with the difference in degrees of freedom, type
dd = lorm$null.deviance - lorm$deviance
ddof = lorm$df.null - lorm$df.residual
dd
## [1] 231.1
ddof
## [1] 1
Then the \( p \)-value as evidence in support of the null model is obtained by typing
1 - pchisq(q = dd, df = ddof)
## [1] 0
This leads you to reject the null hypothesis in favor of the model that includes latitude as a covariate.
A model can fit well but still be inadequate if it is missing an important predictor or if the relationship has a different form. Model adequacy is examined with the residual deviance statistic.
The test is performed under the null hypothesis that the model is adequate. Under this hypothesis, the residual deviance has a \( \chi \)-squared distribution with residual degrees of freedom. Thus to test the model for adequacy you type
pchisq(q = lorm$deviance, df = lorm$df.residual)
## [1] 4.24e-06
The small \( p \)-value indicates the model is not adequate. So while formation latitude is a statistically significant predictor of baroclinic hurricanes, the model can be improved.
To try and improve things you add another variable to the model. Here you create a new model adding the latitude at which maximum intensity first occurred (MaxLat) and examine the table of coefficients.
lorm2 = glm(tb ~ FirstLat + MaxLat, data = bh, family = "binomial")
summary(lorm2)$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.5596 0.97700 -8.761 1.933e-18
## FirstLat 0.5044 0.06617 7.622 2.504e-14
## MaxLat -0.1335 0.04817 -2.772 5.566e-03
Although the latitude at maximum intensity is also statistically significant, something is wrong. The sign on the coefficient is negative indicating that baroclinic hurricanes are more likely if maximum latitude occurs farther south. This lacks physical sense and it indicates a serious problem with the model.
The problem arises because of the strong correlation between your explanatory variables. Check the amount of correlation by typing
cor(bh$FirstLat, bh$MaxLat)
## [1] 0.8548
The correlation exceeds 0.6, so it is best to remove one of your variables. You go back to your one-predictor model, but this time you use maximum latitude. You again check the model for statistical significance and adequacy and find both.
lorm3 = glm(tb ~ MaxLat, data = bh, family = "binomial")
summary(lorm3)$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.9654 0.67600 -8.825 1.099e-18
## MaxLat 0.2072 0.02358 8.784 1.579e-18
pchisq(q = lorm3$deviance, df = lorm3$df.residual)
## [1] 0.5433
Thus you settle on a final model that includes the latitude at maximum intensity as the sole predictor.