\chapter{Frequency Models}
\label{chap:frequencymodels}
%data files: read.table("US.txt", T); load("annual.RData"); read.csv("bi.csv", T)
%packages: xtable, pscl, plotmo, earth, ggplot2, reshape
%source code: NONE
%third party: source("roc.R")
\begin{quote} ``Statistics is the grammar of science.''
\end{quote}
\indent---Karl Pearson\\
Here in Part II of the book we focus on statistical models for understanding and predicting hurricane climate. This chapter shows you how to model hurricane occurrence. This is done using the annual count of hurricanes making landfall in the United States. We also consider the occurrence of hurricanes across the basin and by origin.
We begin with exploratory analysis and then show you how to model counts with Poisson regression. Issues of model fit, interpretation, and prediction are considered in turn. The topic of how to assess forecast skill is also examined including how to perform cross validation. Alternatives to the Poisson regression model are considered. Logistic regession and receiver operating characteristics are also covered.
\section{Counts}
The data set {\it US.txt} contains a list of tropical cyclone counts by year (see Chapter~\ref{chap:Rtutorial}). The counts indicate the number of hurricanes hitting in the United States (excluding Hawaii). Input the data, save them as a data frame object, and print out the first six lines by typing
\begin{Schunk}
\begin{Sinput}
> H = read.table("US.txt", header=TRUE)
> head(H)
\end{Sinput}
\begin{Soutput}
Year All MUS G FL E
1 1851 1 1 0 1 0
2 1852 3 1 1 2 0
3 1853 0 0 0 0 0
4 1854 2 1 1 0 1
5 1855 1 1 1 0 0
6 1856 2 1 1 1 0
\end{Soutput}
\end{Schunk}
The columns include year {\tt Year}, number of U.S. hurricanes {\tt All}, number of major U.S. hurricanes {\tt MUS}, number of U.S. Gulf coast hurricanes {\tt G}, number of Florida hurricanes {\tt FL}, and number of East coast hurricanes {\tt E}. To make subsequent analyses easier save the number of years in the record as \verb@n@ and the average number hurricanes per year as \verb@rate@.
\begin{Schunk}
\begin{Sinput}
> n = length(H$Year); rate = mean(H$All)
> n; rate
\end{Sinput}
\begin{Soutput}
[1] 160
\end{Soutput}
\begin{Soutput}
[1] 1.69
\end{Soutput}
\end{Schunk}
The average number of U.S. hurricanes is 1.69 per year over these 160 years.
Good practice requires you to show your data. This gives readers a way to examine the modeling assumptions you make. Here you plot the time series and distribution of the annual counts. Together the two plots provide a nice summary of the information in your data relevant to any modeling effort.
\begin{Schunk}
\begin{Sinput}
> par(las=1)
> layout(matrix(c(1, 2), 1, 2, byrow=TRUE),
+ widths=c(3/5, 2/5))
> plot(H$Year, H$All, type="h", xlab="Year",
+ ylab="Hurricane Count")
> grid()
> mtext("a", side=3, line=1, adj=0, cex=1.1)
> barplot(table(H$All), xlab="Hurricane Count",
+ ylab="Number of Years", main="")
> mtext("b", side=3, line=1, adj=0, cex=1.1)
\end{Sinput}
\end{Schunk}
The \verb@layout@ function divides the plot page into rows and columns as specified in the matrix function (first argument). The column widths are specified using the \verb@width@ argument. The plot symbol is a vertical bar (\verb@type="h"@). There is no need to connect the bars between years with a line. The tic labels on the vertical axis are presented in whole numbers consistent with count data.
\begin{figure}
\centering
\input{Chap07/Chap07-seriesandhist.tikz}
\vspace{-.7cm}
\caption{Annual hurricane occurrence. (a) Time series and (b) distribution.}
\label{fig:UScounts:timesanddistribution}
\end{figure}
Figure~\ref{fig:UScounts:timesanddistribution} shows the time series and distribution of annual hurricanes over the 160-year period. There is a total of 271 hurricanes. The year-to-year variability and the distribution of counts appear to be consistent with a random count process. There are 34 years without a hurricane and one year (1886) with seven hurricanes. The number of years with a particular hurricane count provides a histogram.
\subsection{Poisson process}
The shape of the histogram suggests a Poisson distribution might be a good description for these data. The density function of the Poisson distribution shows the probability $p$ of obtaining a count $x$ when the mean count (rate) is $\lambda$ is given by
\begin{equation}
p(x) = \frac{\mathrm{e}^{-\lambda} \lambda^x}{x!},
\label{eq:poisdist}
\end{equation}
where $\mathrm{e}$ is the exponential function and ! is the factorial symbol. The equation indicates that probability of no events is $p(0) = \mathrm{e}^{-\lambda}$.
With $\lambda=$1.69 hurricanes per year, the probability of no hurricanes in a random year is
\begin{Schunk}
\begin{Sinput}
> exp(-rate)
\end{Sinput}
\begin{Soutput}
[1] 0.184
\end{Soutput}
\end{Schunk}
This implies that the probability of at least one hurricane is 0.82 or 82~\%.
Using the \verb@dpois@ function you can determine the probability for any number of hurricanes. For example, to determine the probability of observing exactly one hurricane when the rate is 1.69 hurricanes per year, type
\begin{Schunk}
\begin{Sinput}
> dpois(x=1, lambda=rate)
\end{Sinput}
\begin{Soutput}
[1] 0.311
\end{Soutput}
\end{Schunk}
Or the probability of five hurricanes expressed in percent is
\begin{Schunk}
\begin{Sinput}
> dpois(5, rate) * 100
\end{Sinput}
\begin{Soutput}
[1] 2.14
\end{Soutput}
\end{Schunk}
Recall you can leave off the argument names in the function if the argument values are placed in the default order. Remember the argument default order can be found by placing a question mark in front of the function name and leaving off the parentheses. This brings up the function's help page.
To answer the question, {\it what is the probability of two or fewer hurricanes?}, you use the cumulative probability function \verb@ppois@ as follows
\begin{Schunk}
\begin{Sinput}
> ppois(q=2, lambda=rate)
\end{Sinput}
\begin{Soutput}
[1] 0.759
\end{Soutput}
\end{Schunk}
Then to answer the question, {\it what is the probability of more than two hurricanes?}, you add the argument \verb@lower.tail=FALSE@.
\begin{Schunk}
\begin{Sinput}
> ppois(q=2, lambda=rate, lower.tail=FALSE)
\end{Sinput}
\begin{Soutput}
[1] 0.241
\end{Soutput}
\end{Schunk}
\subsection{Inhomogeneous Poisson process}
The Poisson distribution has the property that the variance is equal to the mean. Thus the ratio of the variance to the mean is one. You compute this ratio from your data by typing
\begin{Schunk}
\begin{Sinput}
> round(var(H$All)/rate, 2)
\end{Sinput}
\begin{Soutput}
[1] 1.24
\end{Soutput}
\end{Schunk}
This says that the variance of hurricane counts is 24~\% larger than the mean. Is this unusual for a Poisson distribution?
You check by performing a Monte Carlo (MC) simulation experiment. A MC simulation relies on repeated random sampling. A single random sample of size $n$ from a Poisson distribution with a rate equal to 1.5 is obtained by typing
\begin{Schunk}
\begin{Sinput}
> rpois(n=5, lambda=1.5)
\end{Sinput}
\begin{Soutput}
[1] 0 1 0 2 1
\end{Soutput}
\end{Schunk}
Here you repeat this $m=1000$ times. Let $n$ be the number of years in your hurricane record and $\lambda$ be the rate. For each sample you compute the ratio of the variance to the mean.
\begin{Schunk}
\begin{Sinput}
> set.seed(3042)
> ratio = numeric()
> m = 1000
> for (i in 1:m){
+ h = rpois(n=n, lambda=rate)
+ ratio[i] = var(h)/mean(h)
+ }
\end{Sinput}
\end{Schunk}
The vector {\tt ratio} contains 1000 values of the ratio. To help answer the {\it is this unusual?} question, you determine the proportion of ratios greater than 1.24
\begin{Schunk}
\begin{Sinput}
> sum(ratio > var(H$All)/rate)/m
\end{Sinput}
\begin{Soutput}
[1] 0.028
\end{Soutput}
\end{Schunk}
Only 2.8~\% of the ratios are larger, so the answer from your MC experiment is `yes,' the variability in hurricane counts is higher than you would expect (unusual) from a Poisson distribution with a constant rate.
This indicates that the rate varies over time. Although you can compute a long-term average, some years have a higher rate than others. The variation in the rate is due to things like El Ni\~no. So you expect more variance (extra dispersion) in counts relative to a constant rate (homogeneous Poisson) distribution. This is the basis behind seasonal forecasts. Note that a variation in the annual rate is not obvious from looking at the variation in counts. Even with a constant rate the counts will vary.
You modify your MC simulation using the gamma distribution for the rate and then examine the ratio of variance to the mean from a set of Poisson counts with the variable rate. The gamma distribution describes the variability in the rate using the shape and scale parameters. The mean of the gamma distribution is the shape times the scale. You specify the shape to be 5.6 and the scale to be 0.3 so the product matches closely the long-term average count. You could, of course, choose other values that produce the same average.
Now your simulation first generates 1000 random gamma values and then for each gamma 160 years of hurricane counts are generated.
\begin{Schunk}
\begin{Sinput}
> ratio = numeric(); set.seed(3042); m = 1000
> for (i in 1:m){
+ h = rpois(n=n, lambda=rgamma(m, shape=5.6,
+ scale=.3))
+ ratio[i] = var(h)/mean(h)
+ }
> sum(ratio > var(H$All)/rate)/m
\end{Sinput}
\begin{Soutput}
[1] 0.616
\end{Soutput}
\end{Schunk}
In this case we find 61.6~\% of the ratios are larger, so we conclude that the observed hurricane counts are more consistent with a variable rate (inhomogeneous) Poisson model.
The examples above demonstrate an important use of statistics; to simulate data that have the same characteristics as your observations. Figure~\ref{fig:MCcounts} shows a plot of the observed hurricane record over the 160-year period together with plots from three simulated records of the same length and having the same over-dispersed Poisson variation as the observed record. As shown above, such simulated records provide a way to test hypotheses about natural variability in hurricane climate.
\begin{figure}
\centering
\input{Chap07/Chap07-observedvssimulatedcounts.tikz}
\vspace{-.25cm}
\caption{Hurricane occurrence using (a) observed and (b--d) simulated counts.}
\label{fig:MCcounts}
\end{figure}
Summary characteristics of a 100 years of hurricanes at a coastal location may be of little value, but running a sediment transport model at that location with a large number of simulated hurricane counts will provide an assessment of the uncertainty in sediment movement caused by natural variation in hurricane frequency.
\section{Environmental Variables}
The parameter of interest is the annual rate. Given the rate, you have a probability distribution for any possible hurricane count. You noted the observed counts are consistent with a Poisson model having a variable rate. But where does this variability come from?
On the annual time scale to a first order high ocean heat content and cold upper air temperature provide the fuel for a hurricane, a calm atmosphere allows a hurricane to intensify, and the position and strength of the subtropical high pressure steers a hurricane that does form. Thus, hurricane activity responds to changes in climate conditions that affect these factors including sea-surface temperature (SST) as indicator of oceanic heat content, sunspot number as an indicator of upper air temperature, El Ni\~no-Southern Oscillation (ENSO) as indicator of wind shear, and the North Atlantic Oscillation as an indicator of steering flow.
SST provide an indication of the thermodynamic environment as do sunspots. An increase in solar UV radiation during periods of strong solar activity will have a suppressing affect on tropical cyclone intensity as the air above the hurricane will warm through absorption of radiation by ozone and modulated by dynamic effects in the stratosphere \citep{ElsnerJagger2008}. ENSO is characterized by basin-scale fluctuations in sea-level pressure across the equatorial Pacific Ocean. The Southern Oscillation Index (SOI) is defined as the normalized sea-level pressure difference between Tahiti and Darwin, and values are available back through the middle 19th century. The SOI is strongly anti-correlated with equatorial Pacific SSTs so that an El Ni\~no warming event is associated with negative SOI values.
The NAO is characterized by fluctuations in sea-level pressure (SLP) differences. Monthly values are an indicator of the strength and/or position of the subtropical Bermuda High. The relationship might result from a teleconnection between the midlatitudes and tropics whereby a below normal NAO during the spring leads to dry conditions over the continents and to a tendency for greater summer/fall middle tropospheric ridging (enhancing the dry conditions). Ridging over the eastern and western sides of the North Atlantic basin tends to keep the middle tropospheric trough, responsible for hurricane recurvature, farther to the north during the peak of the season \citep{ElsnerJagger2006}. The data sets containing the environmental variables are described in Chapter~\ref{chap:datasets}, where you also plotted them using four panels (Fig.~\ref{fig:climatevariables}). With the exception of the NAO index the monthly values are averages over the three months of August through October. The NAO index is averaged over the two months of May and June.
\section{Bivariate Relationships}
Consider the relationship between hurricane frequency and a single environmental variable (covariate). Scatter plots are not very useful as many different covariate values exist for a given count. Instead it is better to plot a summary of the covariate distribution for each count.
The five-number summary provides information about the median, the range, and the quartile values of a distribution (see Chapter~\ref{chap:graphsandmaps}). So for example you compare the five number summary of the NAO during years with no hurricanes and during years with three hurricanes by typing
\begin{Schunk}
\begin{Sinput}
> load("annual.RData")
> nao0 = annual$nao[H$All == 0]
> nao3 = annual$nao[H$All == 3]
> fivenum(nao0); fivenum(nao3)
\end{Sinput}
\begin{Soutput}
[1] -2.030 -0.765 -0.165 0.600 1.725
\end{Soutput}
\begin{Soutput}
[1] -2.655 -1.315 -0.685 -0.085 1.210
\end{Soutput}
\end{Schunk}
For each quantile of the five number summary, the NAO value is lower when there are three hurricanes compared to when there are no hurricanes. A plot showing the five number summary values for all years and all covariates is shown in Fig.~\ref{fig:bivariate}. Note that the covariate is by convention plotted on the horizontal axis in a scatter plot, so you make the lines horizontal.
\begin{figure}
\centering
\input{Chap07/Chap07-bivariateplot.tikz}
\vspace{0cm}
\caption{Bivariate relationships between covariates and hurricane counts.}
\label{fig:bivariate}
\end{figure}
The plots show the SOI and NAO are likely important in statistically explaining hurricane counts as there appears to be a fairly systematic variation in counts across their range of values. The variation is less clear with SST and sunspot number. Note that the bivariate relationships do not necessarily tell the entire story. Covariates may contribute redundant information to the response variable so they all might be significant in a multivariate model (see Chapter~\ref{chap:classicalstatistics}).
\section{Poisson Regression}
The model of choice for count data is Poisson regression. Poisson regression assumes the response variable has a Poisson distribution and the logarithm of the expected value of the response variable is modeled with a linear combination of explanatory variables. It is an example of a log-linear model.
\subsection{Limitation of linear regression}
The linear regression model described in Chapter~\ref{chap:classicalstatistics} is not appropriate for count data. To illustrate, here you regress U.S. hurricane counts on the four explanatory variables (covariates) described above. You then use the model to make predictions specifying the SOI and NAO at three standard deviation departures from the average, a large sunspot number, and an average SST value.
To make thing a bit simpler, you first create a data frame object by typing
\begin{Schunk}
\begin{Sinput}
> df = data.frame(All=H$All, SOI=annual$soi,
+ NAO=annual$nao, SST=annual$sst, SSN=annual$ssn)
> df = df[-(1:15), ]
\end{Sinput}
\end{Schunk}
Here the data frame object {\tt df} has columns with labels {\tt All}, {\tt SOI}, {\tt NAO}, {\tt SST}, and {\tt SSN} corresponding to the response variable U.S. hurricane counts and the four explanatory variables. You remove the first fifteen years because of missing SOI values.
You then create a linear regression model object using the \verb@lm@ function specifying the response and covariates accordingly.
\begin{Schunk}
\begin{Sinput}
> lrm = lm(All ~ SOI + NAO + SST + SSN, data=df)
\end{Sinput}
\end{Schunk}
Your model is saved in \verb@lrm@.
Next you use the \verb@predict@ method on the model object together with specific explanatory values specified using the \verb@newdata@ argument. The names must match those used in your model object and each explanatory variable must have a value.
\begin{Schunk}
\begin{Sinput}
> predict(lrm, newdata=data.frame(SOI=-3, NAO=3,
+ SST=0, SSN=250))
\end{Sinput}
\begin{Soutput}
1
-0.318
\end{Soutput}
\end{Schunk}
The prediction results in a negative number that is not a count. It indicates that the climate conditions are unfavorable for hurricanes, but the number has no physical meaning. This is a problem.
\subsection{Poisson regression equation}
A Poisson regression model that specifies the logarithm of the annual hurricane rate is an alternative. The assumption is that the hurricanes are independent in the sense that the arrival of one hurricane will not make another one more or less likely, but the rate of hurricanes varies from year to year due to the covariates.
The Poisson regression model is expressed as
\begin{equation}
\log(\lambda) = \beta_0+\beta_1 x_1 + \ldots +\beta_p x_p + \varepsilon.
\label{eq:poissionregression}
\end{equation}
Here there are $p$ covariates (indicated by the $x_i$'s) and $p+1$ parameters ($\beta_i$'s). The vector $\varepsilon$ is a set of independent and identically distributed residuals. The assumption of independence can be checked by examining whether there is temporal correlation or other patterns in the residual values. The model uses the logarithm of the rate as the response variable, but it is linear in the regression structure. It is not the same as a linear regression on the logarithm of counts. The model coefficients are determined by the {\it method of maximum likelihood}.
It is important to understand that with a Poisson regression you cannot explain all the variation in the observed counts; there will always be unexplainable variation due to the stochastic nature of the process. Thus even if the model precisely predicts the rate of hurricanes, the set of predicted counts will have a degree of variability that cannot be reduced by the model (aleatory uncertainty). Conversely, if you think you can explain most of the variability in the counts, then Poisson regression is not the appropriate model.
\subsection{Method of maximum likelihood}
Given the set of parameters as a vector $\beta$ and a vector of explanatory variables $x$, the mean of the predicted Poisson distribution is given by
\begin{equation}
E(Y|x) = \mathrm{e}^{\beta' x}\,
\end{equation}
and thus, the Poisson distribution's probability density function is given by
\begin{equation}
p(y|x;\beta) = \frac{\mathrm{e}^{y (\beta' x)} \mathrm{e}^{-\mathrm{e}^{\beta' x}}}{y!}
\end{equation}
Suppose you are given a data set consisting of $m$ vectors $x_i \in \mathbb{R}^{n+1}, \, i = 1, \ldots ,m$, along with a set of $m$ values $y_1,\ldots,y_m \in \mathbb{R}$. Then, for a given set of parameters $\beta$, the probability of attaining this particular set of data is given by
\begin{equation}
p(y_1,...,y_m|x_1,\dots,x_m;\beta) = \prod_{i=1}^m \frac{\mathrm{e}^{y_i (\beta' x_i)} \mathrm{e}^{-\mathrm{e}^{\beta' x_i}}}{y_i!}.
\end{equation}
By the method of maximum likelihood, you wish to find the set of parameters $\beta$ that makes this probability as large as possible. To do this, the equation is first rewritten as a likelihood function in terms of $\beta$.
\begin{equation}
L(\beta | X,Y) = \prod_{i=1}^m \frac{\mathrm{e}^{y_i (\beta' x_i)} \mathrm{e}^{-\mathrm{e}^{\beta' x_i}}}{y_i!}.
\end{equation}
Note that the expression on the right-hand side of the equation has not changed. By taking logarithms the equation is easier to work with. The log-likelihood equation is given by
\begin{equation}
\ell(\beta | X,Y) = \log L(\beta|X,Y) = \sum_{i=1}^m \left( y_i (\beta' x_i) - \mathrm{e}^{\beta' x_i} - \log(y_i!)\right).
\end{equation}
Notice that the $\beta$'s only appear in the first two terms of each summation. Therefore, given that you are only interested in finding the best value for $\beta$, you can drop the $y_i !$ and write
\begin{equation}
\ell(\beta | X,Y) = \sum_{i=1}^m \left( y_i (\beta' x_i) - \mathrm{e}^{\beta' x_i}\right).
\end{equation}
This equation has no closed-form solution. However, the negative log-likelihood, $-\ell(\beta | X,Y)$, is a convex function, and so standard convex optimization or gradient descent techniques can be applied to find the optimal value of $\beta$ for which the probability is maximum.
\subsection{Model fit}
The method of maximum likelihood is employed in the \verb@glm@ function to determine the model coefficients. The Poisson regression model is a type of generalized linear model (GLM) in which the logarithm of the rate of occurrence is a linear function of the covariates (predictors).
To fit a Poisson regression model to U.S. hurricanes and save the model as an object, type
\begin{Schunk}
\begin{Sinput}
> prm = glm(All ~ SOI + NAO + SST + SSN, data=df,
+ family="poisson")
\end{Sinput}
\end{Schunk}
The model formula is identical to what you used to fit the linear regression model above. The formula is read `U.S. hurricane counts are modeled as a function of SOI, NAO, SST, and SSN.' Differences from the linear model fitting include the use of the \verb@glm@ function and the argument specifying \verb@family="poisson"@.
You examine the model coefficients by typing
\begin{Schunk}
\begin{Sinput}
> summary(prm)
\end{Sinput}
\end{Schunk}
% latex table generated in R 2.14.1 by xtable 1.7-0 package
% Sat Mar 17 07:04:03 2012
\begin{table}[ht]
\begin{center}
\caption{Coefficients of the Poisson regression model.}
\label{tab:modelcoef}
\begin{tabular}{rrrrr}
\hline
& Estimate & Std. Error & z value & Pr($>$$|$z$|$) \\
\hline
(Intercept) & 0.5953 & 0.1033 & 5.76 & 0.0000 \\
SOI & 0.0619 & 0.0213 & 2.90 & 0.0037 \\
NAO & $-$0.1666 & 0.0644 & $-$2.59 & 0.0097 \\
SST & 0.2290 & 0.2553 & 0.90 & 0.3698 \\
SSN & $-$0.0023 & 0.0014 & $-$1.68 & 0.0928 \\
\hline
\end{tabular}
\end{center}
\end{table}The model coefficients and the associated statistics are shown in the Table~\ref{tab:modelcoef}.
As anticipated from the bivariate relationships, the SOI and SST variables are positively related to the rate of U.S. hurricanes and the NAO and sunspot number are negatively related. You can see that the coefficient on SST is positive, but not statistically significant. Both the SOI and NAO have coefficients that provide convincing evidence against the null hypothesis, while the coefficient on the SSN provides suggestive, but inconclusive evidence against the null.
Statistical significance is based on a null hypothesis that the coefficient value is zero (Chapter~\ref{chap:classicalstatistics}). The ratio of the estimated coefficient to its standard error ($z$-value) has an approximate standard normal distribution assuming the null is true. The probability of finding a $z$-value this extreme or more so is your $p$-value. The smaller the $p$-value, the less support there is for the null hypothesis given your data and model.
You use the \verb@plotmo@ function from the {\bf plotmo} package \citep{Milborrow2012} to plot your model's response when varying one covariate while holding the other covariates constant at their median values.
\begin{Schunk}
\begin{Sinput}
> require(plotmo)
> plotmo(prm)
\end{Sinput}
\end{Schunk}
The results are shown in Fig.~\ref{fig:partialdependence}. As SOI and SST increase so does the hurricane rate, but the rate decreases with increasing NAO and SSN. The curvature arises from the fact that the covariates are related to the counts through the logarithm of the rate.
\begin{figure}
\centering
\input{Chap07/Chap07-partialdependence.tikz}
\vspace{-.5cm}
\caption{Dependence of hurricane rate on covariates in a Poisson regression.}
\label{fig:partialdependence}
\end{figure}
The confidence bands are based on pointwise $\pm$2 standard errors. Note the relatively large width for SOI values above 5 s.d. and for NAO values below $-$2 s.d.
\subsection{Interpretation}
Interpretation of the Poisson regression coefficients is different than the interpretation of the linear regression coefficients explained in Chapter~\ref{chap:classicalstatistics}. For example, the coefficient value on the SOI indicates that for every one standard deviation (s.d.) increase in the SOI, the difference in the logarithm of hurricane rates is 0.062. Since there are other covariates, you must add `given that the other covariates in the model are held constant.'
But what does the difference in the logarithm mean? Note that
\begin{equation}
\log \mathrm{A} - \log \mathrm{B} = \log(\frac{\mathrm{A}}{\mathrm{B}})
\end{equation}
so exponentiating the SOI coefficient value provides a ratio of the hurricane rates for a unit change in the SOI. You do this by typing
\begin{Schunk}
\begin{Sinput}
> exp(summary(prm)$coefficients[2, 1])
\end{Sinput}
\begin{Soutput}
[1] 1.06
\end{Soutput}
\end{Schunk}
and find that for every one s.d. increase in SOI, the hurricane rate increases by a factor of 1.06 or 6~\%. Similarly, since the NAO coefficient value is $-0.167$ you find that for every one s.d. increase in the NAO, the hurricane rate decreases by a factor of 15~\%.
\section{Model Predictions}
Given the model coefficients obtained by the method of maximum likelihood using the \verb@glm@ function and saved as a model object, you make predictions using the \verb@predict@ method. For comparison above here you predict the rate of hurricanes given the coefficient values used above for the linear regression model.
\begin{Schunk}
\begin{Sinput}
> predict(prm, newdata=data.frame(SOI=-3, NAO=3,
+ SST=0, SSN=250), type="response")
\end{Sinput}
\begin{Soutput}
1
0.513
\end{Soutput}
\end{Schunk}
The argument \verb@type="response"@ gives the prediction in terms of the mean response (hurricane rate). By default \verb@type="link"@ which results in a prediction in terms of the link function (here the logarithm of the mean response). Recall the linear regression model gave a prediction that was physically unrealistic. Here the predicted value indicates a small hurricane rate as you would expect given the covariate values, but the rate is a realistic non-negative number.
The predicted rate together with Eq.~\ref{eq:poisdist} provides a probability for each possible count. To see this you create two bar plots, one for a forecast of hurricanes under unfavorable conditions and another for a forecast of hurricanes under favorable conditions. First you save the predicted rate for values of the covariates that are favorable and unfavorable for hurricanes. You then create a vector of counts from zero to six that is used as the set of quantiles for the \verb@dpois@ function and as the names argument in the \verb@barplot@ function. The plotting parameters are set using the \verb@par@ function. To make it easier to compare the probability distributions, limits on the vertical axis (\verb@ylim@) are set the same.
\begin{Schunk}
\begin{Sinput}
> fav = predict(prm, newdata=data.frame(SOI=2, NAO=-2,
+ SST=0, SSN=50), type="response")
> ufa = predict(prm, newdata=data.frame(SOI=-2, NAO=2,
+ SST=0, SSN=200), type="response")
> h = 0:6
> par(mfrow=c(1, 2), las=1)
> barplot(dpois(x=h, lambda=ufa), ylim=c(0, .5),
+ names.arg=h, xlab="Number of Hurricanes",
+ ylab="Probability")
> mtext("a", side=3, line=1, adj=0, cex=1.1)
> barplot(dpois(x=h,lambda=fav), ylim=c(0,.5),
+ names.arg=h, xlab="Number of Hurricanes",
+ ylab="Probability")
> mtext("b", side=3, line=1, adj=0, cex=1.1)
\end{Sinput}
\end{Schunk}
\begin{figure}
\centering
\input{Chap07/Chap07-barplotspredictions.tikz}
\vspace{-.5cm}
\caption{Forecast probabilities for (a) unfavorable and (b) favorable conditions.}
\label{fig:poispredict}
\end{figure}
The result is shown in Fig.~\ref{fig:poispredict}. The forecast probability of two or more hurricanes is 72~\% in years with favorable conditions, but decreases to 16~\% in years with unfavorable conditions. The probability of no hurricanes during years with favorable conditions is 8~\%, which compares with a probability of 48~\% during years with unfavorable conditions. Using the data, the model translates swings in climate to landfall probabilities.
\section{Forecast Skill}
\subsection{Metrics}
Forecast skill refers to how well your predictions match the observations. There are several ways to quantify this match. Here you consider three of the most common, the mean absolute error ($\mathrm{MAE}$), the mean squared error ($\mathrm{MSE}$) and the correlation coefficient ($\mathrm{r}$).
Let $\lambda_i$ be the predicted rate for year $i$ and $o_i$ be the corresponding observed count for that year. Then the three measures of skill over $n$ years are defined by
\begin{eqnarray}
\mathrm{MAE} &=& \frac{1}{n}\sum_{i=1}^n \left| \lambda_i-o_i\right| \\
\mathrm{MSE} &=& \frac{1}{n}\sum_{i=1}^n \left( \lambda_i-o_i\right)^2 \\
\mathrm{r} &=& \frac{\sum (\lambda_i - \bar \lambda) (o_i - \bar o)}{\sqrt{\sum (\lambda_i - \bar \lambda)^2 \sum (o_i - \bar o)^2}}
\end{eqnarray}
You obtain the predicted rate for each year in the record ($\lambda_i$) by typing
\begin{Schunk}
\begin{Sinput}
> prm = glm(All ~ SOI + NAO + SSN, data=df,
+ family="poisson")
> pr = predict(prm, type="response")
\end{Sinput}
\end{Schunk}
You first create a new model removing the insignificant SST covariate. Since each prediction is made for a year with a known hurricane count, it is referred to as a `hindcast.' The word `forecast' is reserved for a prediction made for a year where the hurricane count is unknown (typically in the future).
The Poisson regression hindcasts are given in terms of rates while the observations are counts. So instead of using a rate in the first two formulae above you use the probability distribution of observing $j= 0, 1, \ldots, \infty$ hurricanes. A probabilistic form of the above formulae are
\begin{eqnarray}
\mathrm{MAEp} &=& \frac{1}{n} \sum_{i=1}^{n} \sum_{j=0}^{\infty} \mathrm{dpois}(j,\lambda_i) \cdot \left|j-o_i(j)\right| \\
\mathrm{MSEp} &=& \frac{1}{n} \sum_{i=1}^{n} \sum_{j=0}^{\infty} \mathrm{dpois}(j,\lambda_i) \cdot \left(j-o_i(j)\right)^2 \\
&=& \mathrm{MSE} + \bar \lambda
\end{eqnarray}
where $\mathrm{dpois}(j, h_i)$ is the probability of $j$ hurricanes in the $i$th year given the predicted rate $\lambda_i$, and $o_i(j)$ is one when $j$ is the observed number of hurricanes during the $i$th year and zero, otherwise. Note there is no corresponding probabilistic form to the correlation as a measure of forecast skill.
A prediction model is deemed useful if the skill level exceeds the level of a naive reference model. The percentage above the skill obtained from a naive model is referred to useful skill. The naive model is typically climatology. To obtain the climatological rate you type
\begin{Schunk}
\begin{Sinput}
> clim = glm(All ~ 1, data=df, family="poisson")
> cr = predict(clim, type="response")
\end{Sinput}
\end{Schunk}
Note that the only difference from your model above is that the term to the right of the tilde (tweedle) is a 1. The model predicts a single value representing the mean hurricane rate over the period of record. The value is the same for each year.
Table~\ref{tab:skillmetricsin} shows skill metrics for your U.S. hurricane model and the percentage of useful skill relative to climatology. Note that the correlation is undefined for the climatological forecast.
% latex table generated in R 2.14.1 by xtable 1.7-0 package
% Sat Mar 17 07:04:16 2012
\begin{table}[ht]
\begin{center}
\caption{Forecast skill (in sample).}
\label{tab:skillmetricsin}
\begin{tabular}{rrrr}
\hline
& Poisson & Climatology & Useful \\
\hline
MAE & 1.08 & 1.16 & 7.04 \\
MSE & 1.93 & 2.18 & 11.24 \\
r & 0.34 & & \\
MAEp & 1.44 & 1.50 & 4.08 \\
MSEp & 3.67 & 3.92 & 6.26 \\
\hline
\end{tabular}
\end{center}
\end{table}The useful skill level is between 4.1 and 11.2~\% relative to climatology. While not high, it represents a significant improvement.
\subsection{Cross validation}
The above procedure results in an in-sample assessment of forecast skill. All years of data are used to estimate a single set of model coefficients with the model subsequently used to hindcast each year's hurricane activity. But how well will your model perform when making predictions of the future? This question is best answered with an out-of-sample assessment of skill. An out-of-sample assessment (1) excludes a single year of observations, (2) determines the MLE coefficients of the Poisson regression model using observations from the remaining years, and (3) uses the model to predict the hurricane count for the year left out. This is done $n$ times, removing each year's worth of observations successively. The above skill metrics are then used on these out-of-sample predictions. The procedure is called `cross validation' and where single cases are left out, it is called leave-one-out cross validation (LOOCV).
To perform LOOCV on your Poisson regression model you loop over all years using the index \verb@i@. Within the loop you determine the model using all years except \verb@i@ (\verb@df[-i, ]@ in the \verb@data@ argument). You then make a prediction only for the year you left out (\verb@newdata=df[i, ]@). Note that your climatology model is cross validated as well.
\begin{Schunk}
\begin{Sinput}
> j = 0:15; n = length(df$All)
> prx = numeric()
> crx = numeric()
> for(i in 1:n){
+ prm = glm(All ~ SOI + NAO + SSN,
+ data=df[-i, ], family="poisson")
+ clm = glm(All ~ 1, data=df[-i, ], family="poisson")
+ prx[i] = predict(prm, newdata=df[i, ], type="r")
+ crx[i] = predict(clm, newdata=df[i, ], type="r")
+ }
\end{Sinput}
\end{Schunk}
Skill assessment is done in the same way as for in-sample assessment. The results of the cross-validation assessment of model skill are give in Table~\ref{tab:skillmetricsout}.
% latex table generated in R 2.14.1 by xtable 1.7-0 package
% Sat Mar 17 07:04:18 2012
\begin{table}[ht]
\begin{center}
\caption{Forecast skill (out of sample).}
\label{tab:skillmetricsout}
\begin{tabular}{rrrr}
\hline
& Poisson & Climatology & Useful \\
\hline
MAE & 1.11 & 1.16 & 4.91 \\
MSE & 2.05 & 2.21 & 7.03 \\
r & 0.26 & & \\
MAEp & 1.46 & 1.51 & 2.81 \\
MSEp & 3.80 & 3.95 & 3.78 \\
\hline
\end{tabular}
\end{center}
\end{table}Out-of-sample skill levels are lower. However, this is an estimate of the average skill the model will have when used to make actual forecasts. Always show out-of-sample skill if you intend to use your model to predict the future.
The difference in percentage of usefulness between in-sample and out-of-sample skill is a measure of the over-fit in your model. Over-fit arises when your model interprets random fluctuations as signal. Cross validation helps you protect yourself against being fooled by this type of randomness.
\section{Nonlinear Regression Structure}
Poisson regression specifies a linear relationship between your covariates and the logarithm of the hurricane rate. Linearity in the regression structure can be restrictive if the influence of a covariate changes over its range. Multivariate adaptive regression splines (MARS) is a form of regression introduced by \cite{Friedman1991} that allows for such nonlinearities.
MARS builds models of the form
\begin{equation}
\hat f(x) = \sum_{i=1}^k c_i B_i(x)
\end{equation}
where $B_i(x)$ is a basis function and $c_i$ is a constant coefficient. The model is thus a weighted sum of the $k$ basis functions. A basis function takes one of three forms: Either a constant representing the intercept term, a {\it hinge} function of the form max($0, x - a$), where $a$ is a constant representing the {\it knot} for the hinge function, or a product of two or more hinge functions to allow the basis function the ability to handle interaction between two or more covariates. A hinge function is zero for part of its range, so it partitions your multivariate data into disjoint regions.
The \verb@earth@ function in the {\bf earth} package \citep{Milborrow2011} provides functionality for MARS. The syntax is similar to other models in R. Here you create a model using MARS for your hurricane counts and the same environmental covariates by typing
\begin{Schunk}
\begin{Sinput}
> require(earth)
> mars = earth(All ~ SOI + NAO + SST + SSN, data=df,
+ glm=list(family="poisson"))
\end{Sinput}
\end{Schunk}
A summary method on the model object indicates that only the SOI and NAO are selected as important in explaining hurricane rates. The correlation between the predicted rate and the observed counts is obtained by taking the square root of the R-squared value.
\begin{Schunk}
\begin{Sinput}
> sqrt(mars$rsq)
\end{Sinput}
\begin{Soutput}
[1] 0.469
\end{Soutput}
\end{Schunk}
This value exceeds the correlation from your Poisson regression by 39.7~\% suggesting you might have found a better prediction model.
The partial dependence plot (Fig.~\ref{fig:partialdependencemars}) shows the hinge functions for the two selected covariates.
\begin{figure}
\centering
\input{Chap07/Chap07-partialdependencemars.tikz}
\vspace{-5cm}
\caption{Dependence of hurricane rate on covariates in a MARS model.}
\label{fig:partialdependencemars}
\end{figure}
The knots on the SOI are located at about $-$2 and $+$4 s.d. There is no relationship between the SOI and hurricane counts for the lowest values of SOI and there is a sharp inverse relationship for the largest values. Caution is advised against over-interpretation as the graph only describes the SOI-hurricane relationship for median NAO values. The single knot on the NAO indicates no relationship between the NAO and hurricane counts for values less than about $-$0.5 s.d. This applies only for median SOI values. There are no standard errors from which to obtain confidence bands.
Before making forecasts you use cross validation on your MARS to get an estimate of the correlation ($\mathrm{r}$) between observed and predicted using independent data. You do this by specifying the number of cross validations with the \verb@nfold@ argument\footnote{Cross validation is done if the argument is greater than one}. The function \verb@earth@ builds \verb@nfold@ cross-validated models. For each fold it builds a model with in-sample data and then uses the model to compute the R-squared value from predictions made on the out-of-sample data. For instance, with \verb@nfold=2@ the number of years in the in-sample and out-of-sample is roughly the same. The choice of years is chosen randomly so you set a seed to allow exact replication of your results.
Here you set \verb@nfold@ at 5 as a compromise between having enough data to both build the model and make predictions with it. You stabilize the variance further by specifying \verb@ncross@ to allow 40 different \verb@nfold@ cross validations.
\begin{Schunk}
\begin{Sinput}
> set.seed(3042)
> marsCV = earth(All ~ SOI + NAO, data=df, nfold=5,
+ ncross=40, glm=list(family="poisson"))
\end{Sinput}
\end{Schunk}
The R-squared results are saved in your \verb@marsCV@ object in column \verb@cv.rsq.tab@. The last row gives the mean R-squared value that provides an estimate of the average skill your model will have when it is used to make actual forecasts. The square root of that value is the correlation, obtained given by typing
\begin{Schunk}
\begin{Sinput}
> rn = dim(marsCV$cv.rsq.tab)[1]
> mars.r = sqrt(marsCV$cv.rsq.tab[rn, ][1])
> mars.r
\end{Sinput}
\begin{Soutput}
All
0.291
\end{Soutput}
\end{Schunk}
The mean $\mathrm{r}$ value is 11~\% higher than the $\mathrm{r}$ value from the HOOCV of your Poisson regression model (see Table~\ref{tab:skillmetricsout}). This is an improvement but well below that estimated from your in-sample skill.
\section{Zero-Inflated Count Model}
\label{sec:zeroinflatedcountmodel}
The Poisson regression model is a good place to start when working with count data but it might not be good enough when data exhibit over-dispersion or when there are a large number of zero values. Consider the question of whether your Poisson regression model of hurricane counts is adequate. You examine model adequacy using the residual deviance. The residual deviance is $-$2 times the log-likelihood ratio of a model without covariates compared to your model that has covariates.
The residual deviance along with the residual degrees of freedom are available from the summary method on your {\tt glm} object.
\begin{Schunk}
\begin{Sinput}
> prm = glm(All ~ SOI + NAO + SSN, data=df,
+ family="poisson")
> s = summary(prm)
> rd = s$deviance
> dof = s$df.residual
\end{Sinput}
\end{Schunk}
Under the null hypothesis that your model is adequate the residual deviance has a $\chi^2$ distribution with degrees of freedom equal to the residual degrees of freedom. Caution, as this reversed from the typical case where the null is the opposite of what you hope for. To obtain the $p$-value for a test of model adequacy type
\begin{Schunk}
\begin{Sinput}
> pchisq(rd, dof, lower.tail=FALSE)
\end{Sinput}
\begin{Soutput}
[1] 0.0255
\end{Soutput}
\end{Schunk}
The residual deviance is 175.61 on 141 degrees of freedom resulting in a $p$-value of 0.0255. This provides you with evidence that something is missing.
The problem may be that hurricanes tend to arrive in clusters even after taking into account the covariates that influence hurricanes rates from year to year. This clustering produces over-dispersion in observed counts. You will examine this possibility and what to do about it in Chapter~\ref{chap:clustermodels}.
Another problem is when count data have many zeros. This is typical when there are two processes at work, one determining whether there is at least one event, and one determining how many events. An example is the occurrence of cloud-to-ground lightning strikes inside a tropical cyclone. There will be many more hours with zero strikes due to convective processes that are different than those that produce one or more strikes. These kinds of data can be handled with zero-inflated models.
Zero-inflated count models are mixture models combining a point mass at zero and a proper count distribution. The leaves you with two sources of zeros: one from the point mass and the other from the count distribution. Usually the count model is a Poisson regression and the unobserved state is a binomial regression.
The \verb@zeroinfl@ function in the {\bf pscl} package \citep{ZeileisEtAl2008} is used to fit a zero-inflated model using the method of maximum likelihood. The formula argument mainly describes the count data model, i.e., \verb@y ~ x1 + x2@ specifies a count data regression where all zero counts have the same probability of belonging to the zero component. This is equivalent to the model \verb@y ~ x1 + x2 | 1@, making it more explicit that the zero-inflated model only has an intercept. Additional predictors can be added so that not all zeros have the same probability of belonging to the point mass component or to the count component. A typical formula is \verb@y ~ x1 + x2 | z1 + z2@. The covariates in the zero and the count component can be overlapping (or identical).
For example, to model your U.S. hurricane counts where the count model uses all four covariates and where the binomial model uses only the SST variable, type
\begin{Schunk}
\begin{Sinput}
> require(pscl)
> zim = zeroinfl(All ~ SOI + NAO + SST + SSN | SST,
+ data=df)
\end{Sinput}
\end{Schunk}
The model syntax includes a vertical bar to separate the covariates between the two model components. The returned object is of class {\tt zeroinfl} and is similar to the object of class {\tt glm}. The object contains list for the coefficients and terms of each model component. To get a summary of the model object, type
\begin{Schunk}
\begin{Sinput}
> summary(zim)
\end{Sinput}
\begin{Soutput}
Call:
zeroinfl(formula = All ~ SOI + NAO + SST + SSN |
SST, data = df)
Pearson residuals:
Min 1Q Median 3Q Max
-1.492 -0.774 -0.127 0.583 3.061
Count model coefficients (poisson with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.64087 0.10593 6.05 1.5e-09
SOI 0.06920 0.02236 3.09 0.0020
NAO -0.16928 0.06554 -2.58 0.0098
SST 0.57178 0.28547 2.00 0.0452
SSN -0.00239 0.00143 -1.67 0.0942
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.44 2.03 -2.19 0.029
SST 6.80 3.89 1.75 0.080
Number of iterations in BFGS optimization: 16
Log-likelihood: -231 on 7 Df
\end{Soutput}
\end{Schunk}
Results show that the four covariates have a statistically significant influence on the number of U.S. hurricanes and that SST has a significant relationship with whether or not there will be at least one hurricane. But note that the sign on the SST coefficient is positive in the zero-inflation component indicating that higher SST is associated with {\it more} years {\it without} hurricanes. The sign is also positive in the count component indicating that, given at least one hurricane, higher SST is associated withe a higher probability of two or more hurricanes.
A cross-validation exercise indicates that the zero-inflated model performs slightly worse than the Poisson model. The useful skill as measured by the mean absolute error is 3.7~\% above climatology for your zero-inflated model compared with 4.9~\% above climatology for your Poisson model.
\section{Machine Learning}
You can remove the Poisson assumption all together by employing a machine-learning algorithm that searches your data to find patterns related to the annual counts. This is called `data mining.' A regression tree is a type of machine learning algorithm that outputs a series of decisions with each decision leading to a value of the response or to another decision. If the value of the NAO is less than $-$1 s.d. for example, then the response is two hurricanes. If it is greater, then is the SOI greater than 0.5 s.d. and so on. A single tree will capture the relationships between annual counts and your predictors.
To see how this works, import the annual hurricane data and subset on years since 1950. Create a data frame containing only the basin-wide hurricane counts and SOI and SST as the two predictors.
\begin{Schunk}
\begin{Sinput}
> load("annual.RData")
> dat = subset(annual, Year >= 1950)
> df = data.frame(H=dat$B.1, SOI=dat$soi, SST=dat$sst)
\end{Sinput}
\end{Schunk}
Then using the \verb@tree@ function from the {\bf tree} package \citep{Ripley2011}, type
\begin{Schunk}
\begin{Sinput}
> require(tree)
> rt = tree(H ~ SOI + SST, data=df)
\end{Sinput}
\end{Schunk}
The model syntax is the same as before with the response variable to the left of the tilde and the covariates to the right. To plot the regression tree, type
\begin{Schunk}
\begin{Sinput}
> plot(rt); text(rt)
\end{Sinput}
\end{Schunk}
Instead of interpreting the parameter values from a table of coefficients you interpret a regression tree from an upside-down tree-like diagram. You start at the top. The first branch is a split on the your SST variable at a value of 0.33. The split is a rule. Is the value of SST less than 0.33$^\circ$C? If yes, branch to the left; if no, branch to the right. All splits work this way. Following on the right, the next split is on SOI. If SOI is greater than 0.12 s.d., then the mean value of all years under these conditions is 10.8 hur/yr. This is the end of the branch (leaf). You check this by typing
\begin{Schunk}
\begin{Sinput}
> mean(df$H[df$SST >= .33 & df$SOI > .12])
\end{Sinput}
\begin{Soutput}
[1] 10.8
\end{Soutput}
\end{Schunk}
The model is fit using binary recursive partitioning. Splits are made along coordinate axes of SST and SOI so that on any branch, a split is chosen that maximally distinguishes the hurricane counts. Splitting continues until the variables cannot be split or there are too few years (less than 6 by default). Here SST is the variable explaining the most variation in the counts so it gets selected first. Again, the value at which the split occurs is based on maximizing the difference in counts between the two subsets. The tree has five branches.
In general the key questions are: which variables are best to use and which value gives the best split. The choice of variables is similar to the forward selection procedure of stepwise regression (Chapter~\ref{chap:classicalstatistics}). A prediction is made by determining which leaf is reached based on the values of your predictors. To determine the mean number of hurricanes when SOI is $-$2 s.d. and SST is 0.2$^\circ$C, you use the \verb@predict@ method and type
\begin{Schunk}
\begin{Sinput}
> predict(rt, data.frame(SOI=-2, SST=.2))
\end{Sinput}
\begin{Soutput}
1
7.35
\end{Soutput}
\end{Schunk}
The predicted value depends on the tree. The tree depends on what years are used to grow it. For example, regrow the tree by leaving the last year out and make a prediction using the same two predictor values.
\begin{Schunk}
\begin{Sinput}
> rt2 = tree(H ~ SOI + SST, data=df[-61, ])
> predict(rt2, data.frame(SOI=-2, SST=.2))
\end{Sinput}
\begin{Soutput}
1
5.71
\end{Soutput}
\end{Schunk}
Results are different. Which prediction do you choose? Forecast sensitivity occurs with all statistical models, but is more acute in models that contain a larger number of parameters. Each branch in a regression tree is a parameter so with your two predictors the model has five parameters.
A random forest algorithm side steps the question of prediction choice by making predictions from many trees \citep{Breiman2001}. It creates a sample from the set of all years and grows a tree using data only from the sampled years. It then repeats the sampling and grows another tree. Each tree gives a prediction and the mean is taken. The function \verb@randomForest@ in the {\bf randomForest} package provides a random forest algorithm. For example, type
\begin{Schunk}
\begin{Sinput}
> require(randomForest)
> rf = randomForest(H ~ SOI + SST, data=df)
\end{Sinput}
\end{Schunk}
By default the algorithm grows 500 trees. To make a prediction type,
\begin{Schunk}
\begin{Sinput}
> predict(rf, data.frame(SOI=-2, SST=.2))
\end{Sinput}
\begin{Soutput}
1
4.91
\end{Soutput}
\end{Schunk}
Regression trees and random forest algorithms tend to over fit your data especially when you are searching over a large set of {\it potential} predictors in noisy climate data. Over fitting results in small in-sample error, but large out-of-sample error. Again, a cross-validation exercise is needed if you want to claim the algorithm has superior predictive skill. Cross validation removes the noise specific to each year's set of observations and estimates how well the random forest algorithm finds prediction rules when this coincident information is unavailable. For example, {\it does the random forest algorithm provide better prediction skill than a Poisson regression?} To answer this question you arrange a HOOCV as follows.
\begin{Schunk}
\begin{Sinput}
> n = length(df$H)
> rfx = numeric(n)
> prx = numeric(n)
> for(i in 1:n){
+ rfm = randomForest(H ~ SOI + SST, data=df[-i, ])
+ prm = glm(H ~ SOI + SST, data=df[-i, ],
+ family="poisson")
+ new = df[i, ]
+ rfx[i] = predict(rfm, newdata=new)
+ prx[i] = predict(prm, newdata=new,
+ type="response")
+ }
\end{Sinput}
\end{Schunk}
The out-of-sample mean-squared prediction error is computed by typing
\begin{Schunk}
\begin{Sinput}
> mean((df$H - prx)^2); mean((df$H - rfx)^2)
\end{Sinput}
\begin{Soutput}
[1] 5.07
\end{Soutput}
\begin{Soutput}
[1] 5.36
\end{Soutput}
\end{Schunk}
Results indicate that the Poisson regression performs slightly better than the random forest algorithm in this case although the difference is not statistically significant. The correlation between the actual and predicted value is 0.539 for the Poisson model and 0.502 for the random forest algorithm.
With only a few variables you can examine the bivariate influence of the covariates on the response. Figure~\ref{fig:bivariateinfluence} shows the bivariate influence of SST and SOI on hurricane counts using the random forest algorithm and Poisson regression. Hurricane counts increase with SST and SOI but for high values of SOI the influence of SST is stronger. Similarly for high values of SST the influence of the SOI is more pronounced. The random forest is able to capture non-linearities and thresholds, but at the expense of interpreting some noise as signal as seen by the relative high count with SOI values near $-$3 s.d. and SST values near $-$0.1$^\circ$C.
\begin{figure}
\centering
\input{Chap07/Chap07-bivariateinfluence.tikz}
\vspace{-.2cm}
\caption{Hurricane response. (a) Random forest and (b) Poisson regression.}
\label{fig:bivariateinfluence}
\end{figure}
\section{Logistic Regression}
Some of our research in the 1990's focused on the climatology of tropical cyclones from non-tropical origins \cite{ElsnerLehmillerKimberlain1996, KimberlainElsner1998}. We analyzed available information from each North Atlantic hurricane since 1944 to determine whether we could discern middle latitude influences on development. We classified hurricanes into tropical and baroclinic based on primary origin and development mechanisms. You would like to have a model that predicts a hurricane's group membership based on where the hurricane originated.
Logistic regression is the model of choice when your response variable is dichotomous. Many phenomena can be studied in this way. An event either occurs or it doesn't. The focus is to predict the occurrence of the event. A hurricane forecaster is keen about whether an area of disturbance will develop into a cyclone given the present atmospheric conditions.
Logistic regression is a generalization of the linear regression model (like Poisson regression) where the response variable does not have a normal distribution and the regression structure is linear in the covariates. Like Poisson regression the model coefficients are determined using the method of maximum likelihood.
The mean of a binary variable is a percentage. For example, generate ten random binary values and compute the mean by typing
\begin{Schunk}
\begin{Sinput}
> set.seed(123)
> x = rbinom(n=10, size=1, prob=.5)
> x
\end{Sinput}
\begin{Soutput}
[1] 0 1 0 1 1 0 1 1 1 0
\end{Soutput}
\begin{Sinput}
> mean(x)
\end{Sinput}
\begin{Soutput}
[1] 0.6
\end{Soutput}
\end{Schunk}
Think of the 1's as heads and 0's as tails from ten flips of a fair coin (\verb@prob=.5@). You find 6 heads out of ten. The mean number is the percentage of heads in the sample. The percentage of a particular outcome can be interpreted as a probability so it is denoted as $\pi$. The logistic regression model specifies how $\pi$ is related to a set of explanatory variables.
\subsection{Exploratory analysis}
You input the hurricane data by typing
\begin{Schunk}
\begin{Sinput}
> bh = read.csv("bi.csv", header=TRUE)
> table(bh$Type)
\end{Sinput}
\begin{Soutput}
0 1 3
187 77 73
\end{Soutput}
\end{Schunk}
The type as determined in \cite{ElsnerLehmillerKimberlain1996} is indicated by the variable {\tt Type} with 0 indicating tropical-only, 1 indicating baroclinic influences, and 3 indicating baroclinic initiation. The typing was done subjectively using all the available synoptic information about each hurricane. While the majority of hurricanes form over warm ocean waters of the deep tropics (`tropical-only') some are aided in their formation by interactions with midlatitude jet stream winds (`baroclinically induced'). The stronger, tropical-only hurricanes develop farther south and primarily occur in August and September. The weaker, baroclinically-induced hurricanes occur throughout a longer season.
First combine the baroclinic types into a single group and add this column to the data frame.
\begin{Schunk}
\begin{Sinput}
> bh$tb = as.integer(bh$Type != 0)
> table(bh$tb)
\end{Sinput}
\begin{Soutput}
0 1
187 150
\end{Soutput}
\end{Schunk}
With this grouping there are 187 tropical and 150 baroclinic hurricanes in the record of 337 cyclones. Thus you can state that a hurricane drawn at random from this set of cyclones has about a 55~\% chance of being tropical only.
Your interest is to improve on this climatological probability model by adding a covariate. Here you consider the latitude at which the cyclone first reaches hurricane strength. As an exploratory step you create box plots of the latitudes grouped by hurricane type.
\begin{Schunk}
\begin{Sinput}
> boxplot(bh$FirstLat ~ bh$tb, horizontal=TRUE,
+ notch=TRUE, yaxt="n", boxwex=.4,
+ xlab="Latitude of Hurricane Formation")
> axis(2, at=c(1, 2), labels=c("Tropical",
+ "Baroclinic"))
\end{Sinput}
\end{Schunk}
Here you make the box plots horizontal (Fig.~\ref{fig:sidebysideboxplots}) because latitude is your explanatory variable, which should be plotted on the horizontal axis. With the argument \verb@notch@ switched on, notches are drawn on the box sides. The vertical dash inside the box is the median latitude. Notches extend to $\pm 1.58 \times$ IQR/$\sqrt{n}$, where IQR is the interquartile range (see Chapter~\ref{chap:Rtutorial}) and $n$ is the sample size. If the notches of two box plots do not overlap this provides strong evidence that the two medians are statistically different \citep{ChambersEtAl1983}.
\begin{figure}
\centering
\input{Chap07/Chap07-boxplotsfig.tikz}
\vspace{-.7cm}
\caption{Genesis latitude by hurricane type.}
\label{fig:sidebysideboxplots}
\end{figure}
The median formation latitude for the set of tropical hurricanes is 17.9$^\circ$N and for the set of baroclinic hurricanes is farther north at 29.1$^\circ$N. This makes physical sense as cyclones farther south are less likely to have influence from middle latitude baroclinic disturbances. The relatively small overlap between the two sets of latitudes strengthens your conviction that a latitude variable will improve a model for hurricane type.
\subsection{Logit and logistic functions}
Linear regression is not the appropriate model for binary data. It violates the assumption of equal variance and normality of residuals resulting in invalid standard errors and erroneous hypothesis tests. In its place you use a generalized linear model as you did above with the count data.
However, instead of using the logarithm as the link between the response and the covariates as you did in the Poission regression model, here you use the {\it logit} function. The logit of a number $\pi$ between 0 and 1 is
\begin{equation}
\hbox{logit}(\pi)=\log\left( \frac{\pi}{1-\pi} \right) =\log(\pi)-\log(1-\pi).
\label{eq:logitfunction}
\end{equation}
If $\pi$ is a probability then $\pi/(1-\pi)$ is the corresponding {\it odds}, and the logit of the probability is the logarithm of the odds. Odds are expressed as for:against (read: for to against) something happening. So the odds of a hurricane strike that is posted at 1:4 has a 20~\% chance of occurring.
The logistic regression model is expressed statistically as
\begin{equation}
\hbox{logit}(\pi) = \beta_0+\beta_1 x_1 + \ldots +\beta_p x_p + \varepsilon,
\label{eq:logisticregression}
\end{equation}
where $\pi$ is the mean. There are $p$ covariates ($x_i$'s) and $p+1$ parameters ($\beta_i$'s). The vector $\varepsilon$ is a set of independent and identically distributed (iid) residuals.
To convert logit($\pi$) to $\pi$ (probability of occurrence) you use the {\it logistic} function (inverse of the logit function) given as
\begin{equation}
\hbox{logistic}(\alpha) = \frac{1}{1 + \exp(-\alpha)} = \frac{\exp(\alpha)}{1 + \exp(\alpha)}.
\label{eq:logisticfunction}
\end{equation}
\subsection{Fit and interpretation}
To fit a logistic regression model to hurricane type with latitude as the covariate and saving the model as an object, type
\begin{Schunk}
\begin{Sinput}
> lorm = glm(tb ~ FirstLat, data=bh,
+ family="binomial")
\end{Sinput}
\end{Schunk}
The call function is very similar to what you used to fit the Poisson regression model, but here the family is binomial instead of poisson. The formula is read `hurricane type is modeled as a function of formation latitude.'
The model coefficients are determined by the method of maximum likelihood in the \verb@glm@ function. To produce a table of the coefficients you type
\begin{Schunk}
\begin{Sinput}
> summary(lorm)$coefficients
\end{Sinput}
\begin{Soutput}
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.083 0.9615 -9.45 3.50e-21
FirstLat 0.373 0.0395 9.45 3.49e-21
\end{Soutput}
\end{Schunk}
The estimated coefficients are listed by row. The coefficient for the intercept is the log odds of a hurricane at latitude of zero being baroclinic. In other words, the odds of being baroclinic when the latitude is zero is exp($-9.0826$) = 0.000114. These odds are very low, but that makes sense since no hurricanes form at the equator. So the intercept in the model corresponds to the log odds of a baroclinic hurricane when latitude is at the {\it hypothetical} equator.
Interest is on the coefficient of the formation latitude variable indicated by the row labeled {\tt FirstLat}. The value is 0.373. Before fitting the model you anticipate the formation latitude coefficient to have a positive sign. Why? Because baroclinic (tropical) type hurricanes are coded as 1 (0) in your data set and the box plots show that as formation latitude increases, the chance that a hurricane has baroclinic influences increases. Note that if your response values are character strings (e.g., `to' and `be') rather than coded as 0s and 1s, things will still work, but R will assign 0s and 1s based on alphabetical order and this will affect how you make sense of the coefficient's sign.
The magnitude of the coefficient is interpreted to mean that for every degree increase in formation latitude the log odds increases by a constant 0.373 units, on average. This is not very informative. By taking the exponent of the coefficient value, the interpretation is in terms of an odds ratio.
\begin{Schunk}
\begin{Sinput}
> exp(summary(lorm)$coefficients[2])
\end{Sinput}
\begin{Soutput}
[1] 1.45
\end{Soutput}
\end{Schunk}
Thus for every degree increase in formation latitude the odds ratio increases on average by a constant factor of 1.45 (or 45~\%). This 45~\% increase does not depend on the value of latitude. That is, logistic regression is linear in the odds ratio. The interpretation is valid only over the range of latidues in your data, and physically meaningless for latitudes outside the range where hurricanes occur.
The table of coefficients includes a standard error and $p$-value. Statistical significance is based on a null hypothesis that the coefficient is zero. The ratio of the estimated coefficient to its standard error ($z$-value) has an approximate standard normal distribution assuming the null is true. The probability of finding a $z$-value this extreme or more extreme is the $p$-value. The smaller the $p$-value, the less support there is for the null hypothesis given the data and the model. The lack of support for the null allows us to accept our model.
Also, a confidence interval on the estimated coefficient is obtained by typing
\begin{Schunk}
\begin{Sinput}
> confint(lorm)[2, ]
\end{Sinput}
\begin{Soutput}
2.5 % 97.5 %
0.301 0.456
\end{Soutput}
\end{Schunk}
This is interpreted to mean that although your best estimate for the log odds of a baroclinic hurricane given latitude is 0.373, there is a 95~\% chance that the interval between 0.301 and 0.456 will cover the true log odds.
\subsection{Prediction}
Predictions help you understand your model. As you did previously you use the \verb@predict@ method. To predict the probability that a hurricane picked at random from your data will be baroclinic given that its formation latitude is 20$^\circ$N latitude, you type
\begin{Schunk}
\begin{Sinput}
> predict(lorm, newdata=data.frame(FirstLat=20),
+ type="response")
\end{Sinput}
\begin{Soutput}
1
0.164
\end{Soutput}
\end{Schunk}
Thus the probability of a baroclinic hurricane forming at this low latitude is 16.4~\% on average.
To create a plot of predictions across a range of latitudes, first prepare a vector of latitudes. The vector spans the latitudes in your data set. You specify an increment of 0.1$^\circ$ so the resulting prediction curve is smooth. You then use the predict method with the \verb@se.fit@ switch turned on and save the average prediction and the predictions corresponding to $\pm$1.96$\times$ the standard error.
\begin{Schunk}
\begin{Sinput}
> lats = seq(min(bh$FirstLat), max(bh$FirstLat), .1)
> probs = predict(lorm,
+ newdata=data.frame(FirstLat=lats),
+ type="response", se.fit=TRUE)
> pm = probs$fit
> pu = probs$fit + probs$se.fit * 1.96
> pl = probs$fit - probs$se.fit * 1.96
\end{Sinput}
\end{Schunk}
Finally you plot the data points at 0 and 1 as you did above with the bar plot and add the predicted values using the \verb@lines@ function.
\begin{Schunk}
\begin{Sinput}
> plot(bh$FirstLat, bh$tb, pch=19, cex=.4,
+ ylab="Probability",
+ xlab="Formation Latitude (N)")
> grid()
> lines(lats, pm, lwd=2)
> lines(lats, pu, lwd=2, col="red")
> lines(lats, pl, lwd=2, col="red")
\end{Sinput}
\end{Schunk}
\begin{figure}
\centering
\input{Chap07/Chap07-logisticpredictionplot.tikz}
\vspace{-.7cm}
\caption{Logistic regression model for hurricane type.}
\label{fig:logisticpredictions}
\end{figure}
Results are shown in Fig.~\ref{fig:logisticpredictions}. Tropical-only and baroclinically-enhanced hurricane points are shown along the $y=0$ and $y=1$ lines, respectively. The gray band is the 95~\% pointwise confidence interval. Model predictions make sense. The probability of a baroclinically-enhanced hurricane is less than 20~\% at latitudes south of 20$^\circ$N. However, by latitude 24$^\circ$N, the probability exceeds 50~\% and by latitude 31$^\circ$N the probability exceed 90~\%. Note although the odds ratio is constant, the probability is a nonlinear function of latitude.
\subsection{Fit and adequacy}
Output from a summary method applied to your model object (\verb@summary(lorm)@) prints statistics of model fit including null and deviance residuals and the AIC (see Chapter~\ref{chap:classicalstatistics}). These are shown below the table of coefficients. One measure of model fit is the significance of the overall model.
This test asks whether the model with latitude fits significantly better than a model with just an intercept. An intercept-only model is called a `null' model (no covariates). The test statistic is the difference between the residual deviance for the model with latitude and the null model. The test statistic has a $\chi$-squared distribution with degrees of freedom equal to the differences in degrees of freedom between the latitude model and the null model (i.e. the number of predictors in the model; here just one).
To find the difference in deviance between the two models (i.e. the test statistic) along with the difference in degrees of freedom, type
\begin{Schunk}
\begin{Sinput}
> dd = lorm$null.deviance - lorm$deviance
> ddof = lorm$df.null - lorm$df.residual
> dd; ddof
\end{Sinput}
\begin{Soutput}
[1] 231
\end{Soutput}
\begin{Soutput}
[1] 1
\end{Soutput}
\end{Schunk}
Then the $p$-value as evidence in support of the null model is obtained by typing
\begin{Schunk}
\begin{Sinput}
> 1 - pchisq(q=dd, df=ddof)
\end{Sinput}
\begin{Soutput}
[1] 0
\end{Soutput}
\end{Schunk}
This leads you to reject the null hypothesis in favor of the model that includes latitude as a covariate.
A model can fit well but still be inadequate if it is missing an important predictor or if the relationship has a different form. Model adequacy is examined with the residual deviance statistic. The test is performed under the null hypothesis that the model is adequate (see \S\ref{sec:zeroinflatedcountmodel}). Under this hypothesis, the residual deviance has a $\chi$-squared distribution with residual degrees of freedom. Thus to test the model for adequacy you type
\begin{Schunk}
\begin{Sinput}
> pchisq(q=lorm$deviance, df=lorm$df.residual)
\end{Sinput}
\begin{Soutput}
[1] 4.24e-06
\end{Soutput}
\end{Schunk}
The small $p$-value indicates the model is not adequate. So while formation latitude is a statistically significant predictor of baroclinic hurricanes, the model can be improved.
To try and improve things you add another variable to the model. Here you create a new model adding the latitude at which maximum intensity first occurred (\verb@MaxLat@) and examine the table of coefficients.
\begin{Schunk}
\begin{Sinput}
> lorm2 = glm(tb ~ FirstLat + MaxLat, data=bh,
+ family="binomial")
> summary(lorm2)$coefficients
\end{Sinput}
\begin{Soutput}
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.560 0.9770 -8.76 1.93e-18
FirstLat 0.504 0.0662 7.62 2.50e-14
MaxLat -0.134 0.0482 -2.77 5.57e-03
\end{Soutput}
\end{Schunk}
Although the latitude at maximum intensity is also statistically significant, something is wrong. The sign on the coefficient is negative indicating that baroclinic hurricanes are more likely if maximum latitude occurs farther south. This lacks physical sense and it indicates a serious problem with the model.
The problem arises because of the strong correlation between your two explanatory variables (see Chapter~\ref{chap:classicalstatistics}). You check the correlation between the two variables by typing
\begin{Schunk}
\begin{Sinput}
> cor(bh$FirstLat, bh$MaxLat)
\end{Sinput}
\begin{Soutput}
[1] 0.855
\end{Soutput}
\end{Schunk}
The correlation exceeds 0.6, so it is best to remove one of your variables. You go back to your one-predictor model, but this time you use maximum latitude. You again check the model for statistical significance and adequacy and find both.
\begin{Schunk}
\begin{Sinput}
> lorm3 = glm(tb ~ MaxLat, data=bh, family="binomial")
> summary(lorm3)$coefficients
\end{Sinput}
\begin{Soutput}
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.965 0.6760 -8.82 1.10e-18
MaxLat 0.207 0.0236 8.78 1.58e-18
\end{Soutput}
\begin{Sinput}
> pchisq(q=lorm3$deviance, df=lorm3$df.residual)
\end{Sinput}
\begin{Soutput}
[1] 0.543
\end{Soutput}
\end{Schunk}
Thus you settle on a final model that includes the latitude at maximum intensity as the sole predictor.
\subsection{Receiver operating characteristics}
Your model predicts the probability that a hurricane has baroclinic influences given the latitude at lifetime maximum intensity. To make a decision from this forecast, you need to choose a threshold probability. For example if the probability exceeds 0.5 then you predict a baroclinic hurricane.
Given your set of hindcast probabilities, one for each hurricane, and a discrimination threshold you can create a two-by-two table of observed versus hindcast frequencies indicating how many times you correctly and incorrectly forecast baroclinic and tropical hurricanes. Here you do this using the \verb@table@ function on the logical vector of your predictions together with the vector of observed hurricane types.
\begin{Schunk}
\begin{Sinput}
> tab = table(predict(lorm3, type="response") > .5,
+ bh$tb)
> dimnames(tab) = list(Predicted=c("True",
+ "False"), Observed=c("BE", "TO"))
> tab
\end{Sinput}
\begin{Soutput}
Observed
Predicted BE TO
True 147 51
False 40 99
\end{Soutput}
\end{Schunk}
Note that you add dimension names to the table object to get the row and column names. The results show that of the 150 tropical only hurricanes, 99 are predicted as such by the model using the threshold of 0.5 probability. Similarly, of the 187 baroclinic hurricanes, 147 are predicted as such by the model.
In this binary setup 147 is the number of true positives, 40 is the number of false negatives, 51 is the number of false positives, and 99 is the number of true negatives. The {\it sensitivity} of your binary classification scheme is defined as the true positive proportion given as 147/(147+40) = 79~\%. The {\it specificity} is defined as the true negative proportion given as 99/(40+99) = 71~\%.
Note that the values for sensitivity and specificity depend on your choice of discrimination threshold. For example, by increasing the threshold to 0.7 the sensitivity changes to 94~\% and the specificity to 90~\%.
The false positive proportion is one minus the specificity. As you allow for more false positives you can increase the sensitivity of your model. In the limit if your model predicts all hurricanes to be BE then it will be perfectly sensitive (it makes no mistakes in predicting every baroclinic hurricane as baroclinic), but it will not be specific enough to be useful.
A receiver operating characteristic (ROC) curve is a graph of the sensitivity versus the false positive rate (1$-$specificity) as the discrimination threshold is varied. Here you use code written by Peter DeWitt to generate the ROC curves for your logistic model. Source the code by typing
\begin{Schunk}
\begin{Sinput}
> source("roc.R")
\end{Sinput}
\end{Schunk}
The \verb@roc@ function uses the model formula together with a set of training (testing) and validation data to generate ROC output. Here you use the \verb@sample@ function to take 168 random samples from the set of integers representing the sequence of hurricanes. The corresponding set of hurricanes is used for training and the remaining hurricanes are used for validation.
\begin{Schunk}
\begin{Sinput}
> set.seed(23456)
> idx = sample(dim(bh)[1], trunc(.5*dim(bh)[1]))
> out = roc(formula(lorm), data = bh[idx, ],
+ testing.data=bh[-idx, ], show.auc=FALSE)
\end{Sinput}
\end{Schunk}
The output is a list of three elements with the first two containing the area under the ROC curves for the test and validation data, respectively. To plot the curves type
\begin{Schunk}
\begin{Sinput}
> out$plot
\end{Sinput}
\end{Schunk}
\begin{figure}
\centering
\input{Chap07/Chap07-roccurves.tikz}
\vspace{0cm}
\caption{ROC curves for the logistic regression model of hurricane type.}
\label{fig:roccurves}
\end{figure}
Figure~\ref{fig:roccurves} is made using \verb@ggplot@ (see Chapter~\ref{chap:graphsandmaps}) and shows the two ROC curves corresponding to the training and validation data sets. The area of under the curve is an indication of the amount of model skill with the diagonal line (dotted) indicating no skill. To list the areas type
\begin{Schunk}
\begin{Sinput}
> round(out$testing.auc, 3)
\end{Sinput}
\begin{Soutput}
[1] 0.944
\end{Soutput}
\begin{Sinput}
> round(out$validation.auc, 3)
\end{Sinput}
\begin{Soutput}
[1] 0.905
\end{Soutput}
\end{Schunk}
Both values are close to 1 indicating a skillful model. In general you expect the validation area to be less than the testing area.
You interpret the ROC curve as follows. Looking at the graph, if you allow a false positive proportion of 20~\% (0.2 on the horizontal axis), then you can expect to correctly identify 84~\% of the future BE hurricanes. Since we are interested in future hurricanes, we use the validation curve. Note that if you want to perform better than that, say correctly identifying 95~\% of future BE hurricanes, then you need to accept a false positive rate of about 40~\%.
This chapter showed how to build models for the occurrence of hurricanes. We began by modeling the annual counts of U.S. hurricanes with a Poisson regression and using environmental variables as covariates. We showed how to make predictions with the model and interpret the coefficients. We showed how to assess forecast skill including how to run a cross-validation exercise. We then showed how to include nonlinear terms in the regression using the multivariate adaptive regression splines. We also took a look at zero-inflated models as an alternative to Poisson regression. We finished with an examination of logistic regression for predicting hurricane type. We showed how to interpret the coefficients, make predictions, and evaluate the receiver operating characteristics of the model when a decision threshold is imposed.