Code in Support of RPI10-02-001

date()

## [1] "Mon Dec  3 13:04:02 2012"

setwd("~/Dropbox/Book/Chapter11")

## Error: cannot change working directory

Read the hurricane data

Begin by loading annual.RData. These data were assembled in Chapter 5 of Elsner and Jagger (2013). Subset the data for years starting with 1866.

load("annual.RData")
dat = subset(annual, Year >= 1866)

The covariate Southern Oscillation Index (SOI) data begin in 1866. Next, extract all hurricane counts for the Gulf coast, Florida, and East coast regions.

cts = dat[, c("G.1", "F.1", "E.1")]

Cluster detection

Start by comparing the observed with the expected number of years for the two groups of hurricane counts. The groups include years with no hurricanes and years with three or more. The expected number is from a Poisson distribution with a constant rate.

The idea is that for regions that show a cluster of hurricanes, the observed number of years with no hurricanes and years with three or more hurricanes should be greater than the corresponding expected number. Said another way, a Poisson model with a hurricane rate estimated from counts over all years in regions with hurricane clustering will under estimate the number of years with no hurricanes and years with many hurricanes.

For example, you find the observed number of years without a Florida hurricane and the number of years with more than two hurricanes by typing

obs = table(cut(cts$F.1, breaks = c(-0.5, 0.5, 2.5, Inf)))
obs

## 
## (-0.5,0.5]  (0.5,2.5]  (2.5,Inf] 
##         70         62         13

And the expected numbers for these three groups by typing

n = length(cts$F.1)
mu = mean(cts$F.1)
exp = n * diff(ppois(c(-Inf, 0, 2, Inf), lambda = mu))
exp

## [1] 61.655 75.271  8.074

You use functions in the correlationfuns.R package to get the observed and expected counts for the three regions as a table.

source("correlationfuns.R")
mu = colMeans(cts)
rt = regionTable(cts, mu)

In the Gulf and East coast regions the observed number of years are relatively close to the expected number of years in each of the groups. In contrast, in the Florida region you see the observed number of years exceeds the expected number of years for the no-hurricane and the three-or-more hurricanes groups.

The difference between the observed and expected numbers in each region is used to assess the statistical significance of the clustering. This is done using Pearson residuals and chi-squared statistic. The Pearson residual is the difference between the observed count and expected rate divided by the square root of the variance. The p-value is evidence in support of the null hypothesis of no clustering as indicated by no difference between the observed and expected numbers in each group.

For example, to obtain the chi-squared statistic, type

xis = sum((obs - exp)^2/exp)
xis

## [1] 6.475

The p-value as evidence in support of the null hypothesis is given by

pchisq(q = xis, df = 2, lower.tail = FALSE)

## [1] 0.03926

where df is the degrees of freedom equal to the number of groups minus one.

The p-values for the Gulf and East coasts are greater than .05 indicating little support for the cluster hypothesis. In contrast the p-value for the Florida region is 0.009 using the Pearson residuals and 0.03 using the chi-squared statistic. These values provide evidence the hurricane occurrences in Florida are grouped in time.

Conditional counts

What might be causing this grouping? The extra variation in annual hurricane counts might be due to variation in hurricane rates. You examine this possibility with a Poisson regression model. The model includes an index for the North Atlantic Oscillation (NAO) and the Southern Oscillation (SOI) after Elsner and Jagger (2006). This is a generalized linear model (GLM) approach using the Poisson family that includes a logarithmic link function for the rate.

The code to fit the model, determine the expected count, and table the observed versus expected counts for each region is given by

pfts = list(G.1 = glm(G.1 ~ nao + soi, family = "poisson", data = dat), F.1 = glm(F.1 ~ 
    nao + soi, family = "poisson", data = dat), E.1 = glm(E.1 ~ nao + soi, family = "poisson", 
    data = dat))
prsp = sapply(pfts, fitted, type = "response")
rt = regionTable(cts, prsp, df = 3)
rt

##     Expected <= 0 Observed <= 0 1 <= Expected <= 2 1 <= Observed <= 2
## G.1         67.68            63              69.96                 75
## F.1         63.66            70              72.31                 62
## E.1         72.43            74              67.60                 64
##     Expected >= 3 Observed >= 3 Pearson.test Pearson.pvalue
## G.1         7.359             7        121.1        0.89736
## F.1         9.031            13        172.4        0.04214
## E.1         4.970             7        148.5        0.33771
##     statistic.X-squared p.value
## G.1              0.7043  0.6942
## F.1              3.8479  0.1549
## E.1              1.0554  0.5847

The count model gives an expected number of hurricanes each year. The expected is compared to the observed number as before. Results indicate that clustering is somewhat ameliorated by conditioning the rates on the covariates.

In particular, the Pearson residual reduces to 172.4 with an increase in the corresponding p-value to 0.042. However, the p-value remains near .15 indicating the conditional model, while an improvement, fails to capture all the extra variation in Florida hurricane counts.

A cluster model

Having found evidence that Florida hurricanes arrive in clusters, you model this process. In the simplest case you assume the following. Each cluster has either one or two hurricanes and the annual cluster counts follows a Poisson distribution with a rate \( r \). Note the difference. Earlier you assumed each hurricane was independent and annual hurricane counts followed a Poisson distribution. Further, you let \( p \) be the probability that a cluster will have two hurricanes.

Formally your model can be expressed as follows. Let \( N \) be the number of clusters in a given year and \( X_i, i =1, \dots, N \) be the number of hurricanes in each cluster minus one. Then the number of hurricanes in a given year is given by \( H=N+\sum_{i=1}^N X_i \). Conditional on \( N \), \( M=\sum_{i=1}^N X_i \) has a binomial distribution since the \( X_i \)'s are independent Bernoulli variables and \( p \) is constant. That is, \( H=N+M \), where the annual number of clusters \( N \) has a Poisson distribution with cluster rate \( r \), and \( M \) has a binomial distribution with proportion \( p \) and size \( N \). Here the binomial distribution describes the number of occurrences of at least one hurricane in a sequence of \( N \) independent years, with each year having a probability \( p \) of observing at least one hurricane.

The model has two parameters \( r \) and \( p \). A better parameterization is to use \( \lambda = r(1+p) \) with \( p \) to separate the hurricane frequency from the cluster probability. The parameters do not need to be fixed and can be functions of the covariates. When \( p=0 \), \( H \) is Poisson, and when \( p=1 \), \( H/2 \) is Poisson, the dispersion is two, and the probability that \( H \) is even is 1. You need a way to estimate \( r \) and \( p \).

Parameter estimation

Your goal is a hurricane count distribution for Florida. For that you need an estimate of the annual cluster rate (\( r \)) and the probability (\( p \)) that the cluster size is two. Continuing with the GLM approach you separately estimate the annual hurricane frequency, \( \lambda \), and the annual cluster rate \( r \). The ratio of these two parameters minus one is an estimate of the probability \( p \).

This is reasonable if \( p \) does not vary much, since the annual hurricane count variance is proportional to the expected hurricane count [i.e., \( \mbox{var}(H) = r(1+3p) \propto r \propto E(H) \)]. You estimated the parameters of the annual count model using Poisson regression, which assumes that the variance of the count is, in fact, proportional to the expected count. Thus under the assumption that \( p \) is constant, Poisson regression can be used for estimating \( \lambda \) in the cluster model.

As before, you regress the logarithm of the link function for the cluster rate onto the predictors NAO and SOI. The parameters of this annual cluster count model cannot be estimated directly, since the observed hurricane count does not furnish information about the number of clusters.

Consider the observed set of annual Florida hurricane counts. Since the annual frequency is quite small, the majority of years have either no hurricanes or a single hurricane. You can create a reduced' data set by using an indictor of whether or not there was at least one hurricane. Formally let \( I_i=I(H_i > 0)=I(N_i > 0)) \), then \( I \) is an indicator of the occurrence of a hurricane cluster for each year. You assume \( I \) has a binomial distribution with size parameter of one and a proportion equal to \( \pi \). This leads to a logistic regression model (see Chapter~\ref{chap:frequencymodels}) for \( I \).

Note that since \( \exp(-r) \) is the probability of no clusters, the probability of a cluster \( \pi \) is \( 1 - \exp(-r) \). Thus the cluster rate is \( r = -\log(1-\pi) \). If you use a logarithmic link function on \( r \), then \( \log(r) = \log(-\log(1-\pi)) = \mbox{cloglog}(\pi) \), where cloglog is the complementary log-log function. Thus you model \( I \) using the cloglog function to obtain \( r \).

Your cluster model is a combination of two models, one for the counts another for the clusters.

You start by comparing fitted values from the count model with fitted values from the cluster model. Let \( H_i \) be the hurricane count in year \( i \) and \( \hat \lambda_i \) and \( \hat r_i \) be the fitted annual count and cluster rates, respectively. Then let \( \tau_0 \) be a test statistic given by
\( \tau_0=\frac{1}{n}\sum_{i=1}^n (H_i-\hat{r_i}) = \frac{1}{n}\sum_{i=1}^n (\hat{\lambda}_i-\hat{r_i}) \).

The value of \( \tau_0 \) is greater than one if there is clustering. You test the significance of \( \tau_0 \) by generating random samples of length \( n \) from a Poisson distribution with rate \( \lambda_i \) and computing \( \tau_j \) for \( j =1, \ldots, N \), where \( N \) is the number of samples. A test of the null hypothesis that \( \tau_0 \le 0 \) is the proportion of simulated \( \tau \)'s that are at least as large as \( \tau_0 \).

You do this with the testfits() function in the correlationfuns.R package by specifying the model formula, data, and number of random samples.

tfF = testfits(F.1 ~ nao + soi, data = dat, N = 100)
tfF$test

## [1] 0.1043

tfF$testpvalue

## [1] 0.02

For Florida hurricanes the test statistic \( \tau_0 \) has a value of 0.104 indicating some difference in count and cluster rates. The proportion of 100 simulated \( \tau \)'s that are as least as large as this is 0.02 providing sufficient evidence to reject the no-cluster hypothesis.

Repeating with Gulf Coast hurricanes

tfG = testfits(G.1 ~ nao + soi, data = dat, N = 100)
tfG$test

## [1] -0.06174

tfG$testpvalue

## [1] 0.8

you find little evidence against the no-cluster hypothesis.

A linear regression through the origin of the fitted count rate on the cluster rate under the assumption that \( p \) is constant yields an estimate for \( 1+p \). You plot the annual count and cluster rates and draw the regression line using the \verb@plotfits@ function.

par(mfrow = c(1, 2), pty = "s")
ptfF = plotfits(tfF)
mtext("a", side = 3, line = 1, adj = 0, cex = 1.1)
ptfG = plotfits(tfG)
mtext("b", side = 3, line = 1, adj = 0, cex = 1.1)

plot of chunk plotCountVsClusterRates

Count versus cluster rates for (a) Florida and (b) Gulf coast.

The black line is the \( y=x \) line and you expect cluster and hurricane rates to align along this axis if there is no clustering. The red line is the regression of the fitted hurricane rate onto the fitted cluster rate with the intercept set to zero. The slope of the line is an estimate of \( 1+p \).

The regression slopes are printed by typing,

coefficients(ptfF)

##  rate 
## 1.138

coefficients(ptfG)

##   rate 
## 0.9425

The slope is 1.14 for the Florida region giving 0.14 as an estimate for \( p \) (probability the cluster will have two hurricanes). The regression slope is 0.94 for the Gulf coast region which you interpret as a lack of evidence for hurricane clusters.

Your focus is now on Florida hurricanes only. You continue by looking at the coefficients from both models. Type

summary(tfF$fits$poisson)$coef

##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.26820    0.10533  -2.546  0.01089
## nao         -0.22579    0.09023  -2.502  0.01234
## soi          0.05618    0.03026   1.856  0.06340

summary(tfF$fits$binomial)$coef

##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.41950    0.13476 -3.1129 0.001852
## nao         -0.26861    0.12059 -2.2275 0.025917
## soi          0.02123    0.04059  0.5232 0.600840

The output coefficient tables show that the NAO and SOI covariates are significant in the hurricane count model, but only the NAO is significant in the hurricane cluster model.

The difference in coefficient values from the two models is an estimate of \( \log(1+p) \), where again \( p \) is the probability that a cluster will have two hurricanes. The difference in the NAO coefficient is 0.043 and the difference in the SOI coefficient is 0.035 indicating the NAO increases the probability of clustering more than ENSO. Lower values of the NAO lead to a larger rate increase for the Poisson model relative to the binomial model.

Forecasts

It is interesting to compare forecasts of the distribution of Florida hurricanes using a Poisson model and your cluster model. Here you set \( p=.138 \) for the cluster model. You can use the same two-component formulation for your Poisson model by setting \( p=0 \).

You prepare your data using the \verb@lambdapclust@ function as follows.

ctsF = cts[, "F.1", drop = FALSE]
pars = lambdapclust(prsp[, "F.1", drop = FALSE], p = 0.138)
ny = nrow(ctsF)
h = 0:5

Next you compute the expected number of years with h hurricanes from the cluster and Poisson models and tabulate the observed number of years. You combine them in a data object.

eCl = sapply(h, function(x) sum(do.call("dclust", c(x = list(rep(x, ny)), pars))))
ePo = sapply(0:5, function(x) sum(dpois(x = rep(x, ny), lambda = prsp[, "F.1"])))
o = as.numeric(table(ctsF))
dat = rbind(o, eCl, ePo)
names(dat) = 0:5

Finally you plot the observed versus the expected from the cluster and Poisson models using a bar plot where the bars are plotted side-by-side.

barplot(dat, ylab = "Number of Years", xlab = "Number of Florida Hurricanes", 
    names.arg = c(0:5), col = c("black", "red", "blue"), legend = c("Observed", 
        "Cluster", "Poisson"), beside = TRUE)

plot of chunk sideBysidBarplot

The expected numbers are based on a cluster model (\( p \) = 0.137) and on a Poisson model (\( p \) = 0). The cluster model fits the observed counts better than does the Poisson model particularly at the low and high count years.

Florida had hurricanes in only two of the 11 years from 2000 through 2010. But these two years featured seven hurricanes. Seasonal forecast models that predict U.S. hurricane activity assume a Poisson distribution. You show here this assumption applied to Florida hurricanes leads to a forecast that under predicts both the number of years without hurricanes and the number of years with three or more hurricanes JaggerElsner (2012).

The lack of fit in the forecast model arises due to clustering of hurricanes along this part of the coast. You demonstrate a temporal cluster model that assumes the rate of hurricane clusters follows a Poisson distribution with the size of the cluster limited to two hurricanes. The model fits the distribution of Florida hurricanes better than a Poisson model when both are conditioned on the NAO and SOI.