Code in Support of RPI10-02-001
===============================

```{r date}
date()
setwd("~/Dropbox/Book/Chapter11")
```

Read the hurricane data
-----------------------
Begin by loading *annual.RData*.  These data were assembled in Chapter 5 of Elsner and Jagger (2013).  Subset the data for years starting with 1866.
```{r readAnnualData}
load("annual.RData")
dat = subset(annual, Year >= 1866)
```
The covariate Southern Oscillation Index (SOI) data begin in 1866.  Next, extract all hurricane counts for the Gulf coast, Florida, and East coast regions.
```{r extractGFEregionsData}
cts = dat[, c("G.1", "F.1", "E.1")]
```

Cluster detection
-----------------
Start by comparing the observed with the expected number of years for the two groups of hurricane counts.  The groups include years with no hurricanes and years with three or more.  The expected number is from a Poisson distribution with a constant rate.  

The idea is that for regions that show a cluster of hurricanes, the observed number of years with no hurricanes and years with three or more hurricanes should be greater than the corresponding expected number.  Said another way, a Poisson model with a hurricane rate estimated from counts over all years in regions with hurricane clustering will under estimate the number of years with no hurricanes and years with many hurricanes.

For example, you find the observed number of years without a Florida hurricane and the number of years with more than two hurricanes by typing
```{r observedCountsFL}
obs = table(cut(cts$F.1, 
  breaks=c(-.5, .5, 2.5, Inf)))
obs
```
And the expected numbers for these three groups by typing
```{r expectedRate}
n = length(cts$F.1)
mu = mean(cts$F.1)
exp = n * diff(ppois(c(-Inf, 0, 2, Inf), lambda=mu))
exp
```

You use functions in the **correlationfuns.R** package to get the observed and expected counts for the three regions as a table.
```{r observedExpected}
source("correlationfuns.R")
mu = colMeans(cts)
rt = regionTable(cts, mu)
```

In the Gulf and East coast regions the observed number of years are relatively close to the expected number of years in each of the groups.  In contrast, in the Florida region you see the observed number of years exceeds the expected number of years for the no-hurricane and the three-or-more hurricanes groups.

The difference between the observed and expected numbers in each region is used to assess the statistical significance of the clustering.  This is done using Pearson residuals and chi-squared statistic.  The Pearson residual is the difference between the observed count and expected rate divided by the square root of the variance.  The p-value is evidence in support of the null hypothesis of no clustering as indicated by no difference between the observed and expected numbers in each group.

For example, to obtain the chi-squared statistic, type
```{r chiSquaredStatistic}
xis = sum((obs - exp)^2 / exp)
xis
```
The p-value as evidence in support of the null hypothesis is given by
```{r chiSquaredTest}
pchisq(q=xis, df=2, lower.tail=FALSE)
```
where df is the degrees of freedom equal to the number of groups minus one.

The p-values for the Gulf and East coasts are greater than .05 indicating little support for the cluster hypothesis.  In contrast the p-value for the Florida region is `r I(round(rt[2,8],3))` using the Pearson residuals and `r I(round(rt[2,10],3))` using the chi-squared statistic.  These values provide evidence the hurricane occurrences in Florida are grouped in time.

Conditional counts
------------------
What might be causing this grouping?  The extra variation in annual hurricane counts might be due to variation in hurricane rates.  You examine this possibility with a Poisson regression model.  The model includes an index for the North Atlantic Oscillation (NAO) and the Southern Oscillation (SOI) after Elsner and Jagger (2006).  This is a generalized linear model (GLM) approach using the Poisson family that includes a logarithmic link function for the rate. 

The code to fit the model, determine the expected count, and table the observed versus expected counts for each region is given by
```{r observedAndExpected}
pfts = list(G.1 = glm(G.1 ~ nao + soi, 
  family="poisson", data=dat), 
  F.1 = glm(F.1 ~ nao + soi, family="poisson", 
  data=dat),
  E.1 = glm(E.1 ~ nao + soi, family="poisson",
  data=dat))
prsp = sapply(pfts, fitted, type="response")
rt = regionTable(cts, prsp, df=3)
rt
```

The count model gives an expected number of hurricanes each year.  The expected is compared to the observed number as before.  Results indicate that clustering is somewhat ameliorated by conditioning the rates on the covariates.  

In particular, the Pearson residual reduces to `r I(round(rt[2,7],1))` with an increase in the corresponding p-value to `r I(round(rt[2,8],3))`.  However, the p-value remains near .15 indicating the conditional model, while an improvement, fails to capture all the extra variation in Florida hurricane counts.

A cluster model
---------------
Having found evidence that Florida hurricanes arrive in clusters, you model this process.  In the simplest case you assume the following.  Each cluster has either one or two hurricanes and the annual cluster counts follows a Poisson distribution with a rate $r$.  Note the difference.  Earlier you assumed each hurricane was independent and annual hurricane counts followed a Poisson distribution.  Further, you let $p$ be the probability that a cluster will have two hurricanes.

Formally your model can be expressed as follows. Let $N$ be the number of clusters in a given year and $X_i, i =1, \dots, N$ be the number of hurricanes in each cluster minus one.  Then the number of hurricanes in a given year is given by $H=N+\sum_{i=1}^N X_i$.  Conditional on $N$, $M=\sum_{i=1}^N X_i$ has a binomial distribution since the $X_i$'s are independent Bernoulli variables and $p$ is constant.  That is, $H=N+M$, where the annual number of clusters $N$ has a Poisson distribution with cluster rate $r$, and $M$ has a binomial distribution with proportion $p$ and size $N$.  Here the binomial distribution describes the number of occurrences of at least one hurricane in a sequence of $N$ independent years, with each year having a probability $p$ of observing at least one hurricane.

The model has two parameters $r$ and $p$.  A better parameterization is to use $\lambda = r(1+p)$ with $p$ to separate the hurricane frequency from the cluster probability.  The parameters do not need to be fixed and can be functions of the covariates. When $p=0$, $H$ is Poisson, and when $p=1$, $H/2$ is Poisson, the dispersion is two, and the probability that $H$ is even is 1. You need a way to estimate $r$ and $p$.

Parameter estimation
--------------------
Your goal is a hurricane count distribution for Florida.  For that you need an estimate of the annual cluster rate ($r$) and the probability ($p$) that the cluster size is two.  Continuing with the GLM approach you separately estimate the annual hurricane frequency, $\lambda$, and the annual cluster rate $r$.  The ratio of these two parameters minus one is an estimate of the probability $p$.

This is reasonable if $p$ does not vary much, since the annual hurricane count variance is proportional to the expected hurricane count [i.e., $\mbox{var}(H) = r(1+3p) \propto r \propto E(H)$].  You estimated the parameters of the annual count model using Poisson regression, which assumes that the variance of the count is, in fact, proportional to the expected count. Thus under the assumption that $p$ is constant, Poisson regression can be used for estimating $\lambda$ in the cluster model.

As before, you regress the logarithm of the link function for the cluster rate onto the predictors NAO and SOI. The parameters of this annual cluster count model cannot be estimated directly, since the observed hurricane count does not furnish information about the number of clusters.

Consider the observed set of annual Florida hurricane counts.  Since the annual frequency is quite small, the majority of years have either no hurricanes or a single hurricane.  You can create a `reduced' data set by using an indictor of whether or not there was at least one hurricane.  Formally let $I_i=I(H_i > 0)=I(N_i > 0))$, then $I$ is an indicator of the occurrence of a hurricane cluster for each year.  You assume $I$ has a binomial distribution with size parameter of one and a proportion equal to $\pi$.  This leads to a logistic regression model (see Chapter~\ref{chap:frequencymodels}) for $I$.

Note that since $\exp(-r)$ is the probability of no clusters, the probability of a cluster $\pi$ is $1 - \exp(-r)$.  Thus the cluster rate is $r = -\log(1-\pi)$.  If you use a logarithmic link function on $r$, then $\log(r) = \log(-\log(1-\pi)) = \mbox{cloglog}(\pi)$, where cloglog is the complementary log-log function.  Thus you model $I$ using the cloglog function to obtain $r$.

Your cluster model is a combination of two models, one for the counts another for the clusters. 

You start by comparing fitted values from the count model with fitted values from the cluster model.  Let $H_i$ be the hurricane count in year $i$ and $\hat \lambda_i$ and $\hat r_i$ be the fitted annual count and cluster rates, respectively.  Then let $\tau_0$ be a test statistic given by
$\tau_0=\frac{1}{n}\sum_{i=1}^n (H_i-\hat{r_i}) = \frac{1}{n}\sum_{i=1}^n (\hat{\lambda}_i-\hat{r_i})$.

The value of $\tau_0$ is greater than one if there is clustering.  You test the significance of $\tau_0$ by generating random samples of length $n$ from a Poisson distribution with rate $\lambda_i$ and computing $\tau_j$ for $j =1, \ldots, N$, where $N$ is the number of samples.  A test of the null hypothesis that $\tau_0 \le 0$ is the proportion of simulated $\tau$'s that are at least as large as $\tau_0$.

You do this with the testfits() function in the **correlationfuns.R** package by specifying the model formula, data, and number of random samples.
```{r testFitsFL}
tfF = testfits(F.1 ~ nao + soi, data=dat, N=100)
tfF$test; tfF$testpvalue
```
For Florida hurricanes the test statistic $\tau_0$ has a value of `r I(round(tfF$test,3))` indicating some difference in count and cluster rates.  The proportion of 100 simulated $\tau$'s that are as least as large as this is `r I(round(tfF$testpvalue,3))` providing sufficient evidence to reject the no-cluster hypothesis.  

Repeating with Gulf Coast hurricanes
```{r testFitsG}
tfG = testfits(G.1 ~ nao + soi, data=dat, N=100)
tfG$test; tfG$testpvalue
```
you find little evidence against the no-cluster hypothesis.

A linear regression through the origin of the fitted count rate on the cluster rate under the assumption that $p$ is constant yields an estimate for $1+p$.  You plot the annual count and cluster rates and draw the regression line using the \verb@plotfits@ function.
```{r plotCountVsClusterRates}
par(mfrow=c(1, 2), pty="s")
ptfF = plotfits(tfF)
mtext("a", side=3, line=1, adj=0, cex=1.1)
ptfG = plotfits(tfG)
mtext("b", side=3, line=1, adj=0, cex=1.1)
```
**Count versus cluster rates for (a) Florida and (b) Gulf coast.**

The black line is the $y=x$ line and you expect cluster and hurricane rates to align along this axis if there is no clustering.  The red line is the regression of the fitted hurricane rate onto the fitted cluster rate with the intercept set to zero.  The slope of the line is an estimate of $1+p$.

The regression slopes are printed by typing,
```{r slopes}
coefficients(ptfF); coefficients(ptfG)
```
The slope is `r I(round(as.numeric(coefficients(ptfF)),2))` for the Florida region giving `r I(round(as.numeric(coefficients(ptfF))-1,2))` as an estimate for $p$ (probability the cluster will have two hurricanes).  The regression slope is `r I(round(as.numeric(coefficients(ptfG)),2))` for the Gulf coast region which you interpret as a lack of evidence for hurricane clusters.

Your focus is now on Florida hurricanes only.  You continue by looking at the coefficients from both models.  Type
```{r modelCoefficients}
summary(tfF$fits$poisson)$coef
summary(tfF$fits$binomial)$coef
```

The output coefficient tables show that the NAO and SOI covariates are significant in the hurricane count model, but only the NAO is significant in the hurricane cluster model.

The difference in coefficient values from the two models is an estimate of $\log(1+p)$, where again $p$ is the probability that a cluster will have two hurricanes.  The difference in the NAO coefficient is `r I(round(summary(tfF$fits$poisson)$coef[2,1]-summary(tfF$fits$binomial)$coef[2,1],3))` and the difference in the SOI coefficient is `r I(round(summary(tfF$fits$poisson)$coef[3,1]-summary(tfF$fits$binomial)$coef[3,1],3))` indicating the NAO increases the probability of clustering more than ENSO.  Lower values of the NAO lead to a larger rate increase for the Poisson model relative to the binomial model.

Forecasts
---------
It is interesting to compare forecasts of the distribution of Florida hurricanes using a Poisson model and your cluster model.  Here you set $p=.138$ for the cluster model.  You can use the same two-component formulation for your Poisson model by setting $p=0$.

You prepare your data using the \verb@lambdapclust@ function as follows.
```{r prepareData}
ctsF = cts[, "F.1", drop=FALSE]
pars = lambdapclust(prsp[, "F.1", drop=FALSE], 
  p=.138)
ny = nrow(ctsF)
h = 0:5
```
Next you compute the expected number of years with h hurricanes from the cluster and Poisson models and tabulate the observed number of years.  You combine them in a data object.
```{r expectedAndObserved}
eCl = sapply(h, function(x)
   sum(do.call('dclust',
   c(x=list(rep(x, ny)), pars))))
ePo = sapply(0:5, function(x)
   sum(dpois(x=rep(x,ny),
   lambda=prsp[, "F.1"])))
o = as.numeric(table(ctsF))
dat = rbind(o, eCl, ePo)
names(dat) = 0:5
```
Finally you plot the observed versus the expected from the cluster and Poisson models using a bar plot where the bars are plotted side-by-side.
```{r sideBysidBarplot}
barplot(dat, ylab="Number of Years",
   xlab="Number of Florida Hurricanes",
   names.arg=c(0:5),
   col=c("black", "red", "blue"),
   legend=c("Observed", "Cluster", "Poisson"),
   beside=TRUE)
```

The expected numbers are based on a cluster model ($p$ = 0.137) and on a Poisson model ($p$ = 0).  The cluster model fits the observed counts better than does the Poisson model particularly at the low and high count years.

Florida had hurricanes in only two of the 11 years from 2000 through 2010.  But these two years featured seven hurricanes. Seasonal forecast models that predict U.S. hurricane activity assume a Poisson distribution.  You show here this assumption applied to Florida hurricanes leads to a forecast that under predicts both the number of years without hurricanes and the number of years with three or more hurricanes Jagger and Elsner (2012).

The lack of fit in the forecast model arises due to clustering of hurricanes along this part of the coast.  You demonstrate a temporal cluster model that assumes the rate of hurricane clusters follows a Poisson distribution with the size of the cluster limited to two hurricanes.  The model fits the distribution of Florida hurricanes better than a Poisson model when both are conditioned on the NAO and SOI.