Chapter 11: Cluster Models

“There are in fact two things, science and opinion; the former begets knowledge, the latter, ignorance.”—Hippocrates

data files used: read.table(“gridboxes.txt”); read.csv(“bi.csv”, T); load(“annual.RData”); load(“best.use.RData”)
required packages: maps, spatstat, sp, rgdal, maptools, cluster, xtable, RColorBrewer
source code: source(“correlationfuns.R”)

Read the hurricane data

Begin by loading annual.RData. These data were assembled in Chapter 5 of Elsner and Jagger (2013). Subset the data for years starting with 1866.

load("annual.RData")
dat = subset(annual, Year >= 1866)

The covariate Southern Oscillation Index (SOI) data begin in 1866. Next, extract all hurricane counts for the Gulf coast, Florida, and East coast regions.

cts = dat[, c("G.1", "F.1", "E.1")]

Cluster detection

Start by comparing the observed with the expected number of years for the two groups of hurricane counts. The groups include years with no hurricanes and years with three or more. The expected number is from a Poisson distribution with a constant rate.

The idea is that for regions that show a cluster of hurricanes, the observed number of years with no hurricanes and years with three or more hurricanes should be greater than the corresponding expected number. Said another way, a Poisson model with a hurricane rate estimated from counts over all years in regions with hurricane clustering will under estimate the number of years with no hurricanes and years with many hurricanes.

For example, you find the observed number of years without a Florida hurricane and the number of years with more than two hurricanes by typing

obs = table(cut(cts$F.1, breaks = c(-0.5, 0.5, 2.5, Inf)))
obs

## 
## (-0.5,0.5]  (0.5,2.5]  (2.5,Inf] 
##         70         62         13

And the expected numbers for these three groups by typing

n = length(cts$F.1)
mu = mean(cts$F.1)
exp = n * diff(ppois(c(-Inf, 0, 2, Inf), lambda = mu))
exp

## [1] 61.655 75.271  8.074

You use functions in the correlationfuns.R package to get the observed and expected counts for the three regions as a table.

source("correlationfuns.R")
mu = colMeans(cts)
rt = regionTable(cts, mu)

In the Gulf and East coast regions the observed number of years are relatively close to the expected number of years in each of the groups. In contrast, in the Florida region you see the observed number of years exceeds the expected number of years for the no-hurricane and the three-or-more hurricanes groups.

The difference between the observed and expected numbers in each region is used to assess the statistical significance of the clustering. This is done using Pearson residuals and chi-squared statistic. The Pearson residual is the difference between the observed count and expected rate divided by the square root of the variance. The p-value is evidence in support of the null hypothesis of no clustering as indicated by no difference between the observed and expected numbers in each group.

For example, to obtain the chi-squared statistic, type

xis = sum((obs - exp)^2/exp)
xis

## [1] 6.475

The p-value as evidence in support of the null hypothesis is given by

pchisq(q = xis, df = 2, lower.tail = FALSE)

## [1] 0.03926

where df is the degrees of freedom equal to the number of groups minus one.

The p-values for the Gulf and East coasts are greater than .05 indicating little support for the cluster hypothesis. In contrast the p-value for the Florida region is 0.009 using the Pearson residuals and 0.047 using the chi-squared statistic. These values provide evidence the hurricane occurrences in Florida are grouped in time.

Conditional counts

What might be causing this grouping? The extra variation in annual hurricane counts might be due to variation in hurricane rates. You examine this possibility with a Poisson regression model. The model includes an index for the North Atlantic Oscillation (NAO) and the Southern Oscillation (SOI) after Elsner and Jagger (2006). This is a generalized linear model (GLM) approach using the Poisson family that includes a logarithmic link function for the rate.

The code to fit the model, determine the expected count, and table the observed versus expected counts for each region is given by

pfts = list(G.1 = glm(G.1 ~ nao + soi, family = "poisson", data = dat), F.1 = glm(F.1 ~ 
    nao + soi, family = "poisson", data = dat), E.1 = glm(E.1 ~ nao + soi, family = "poisson", 
    data = dat))
prsp = sapply(pfts, fitted, type = "response")
rt = regionTable(cts, prsp, df = 3)
rt

##     Expected <= 0 Observed <= 0 1 <= Expected <= 2 1 <= Observed <= 2
## G.1         67.68            63              69.96                 75
## F.1         63.66            70              72.31                 62
## E.1         72.43            74              67.60                 64
##     Expected >= 3 Observed >= 3 Pearson.test Pearson.pvalue
## G.1         7.359             7        121.1        0.89736
## F.1         9.031            13        172.4        0.04214
## E.1         4.970             7        148.5        0.33771
##     statistic.X-squared p.value
## G.1              0.7043  0.7011
## F.1              3.8479  0.1419
## E.1              1.0554  0.5872

The count model gives an expected number of hurricanes each year. The expected is compared to the observed number as before. Results indicate that clustering is somewhat ameliorated by conditioning the rates on the covariates.

In particular, the Pearson residual reduces to 172.4 with an increase in the corresponding p-value to 0.042. However, the p-value remains near .15 indicating the conditional model, while an improvement, fails to capture all the extra variation in Florida hurricane counts.

A cluster model

Having found evidence that Florida hurricanes arrive in clusters, you model this process. In the simplest case you assume the following. Each cluster has either one or two hurricanes and the annual cluster counts follows a Poisson distribution with a rate \( r \). Note the difference. Earlier you assumed each hurricane was independent and annual hurricane counts followed a Poisson distribution. Further, you let \( p \) be the probability that a cluster will have two hurricanes.

Formally your model can be expressed as follows. Let \( N \) be the number of clusters in a given year and \( X_i, i =1, \dots, N \) be the number of hurricanes in each cluster minus one. Then the number of hurricanes in a given year is given by \( H=N+\sum{i=1}^N X_i \). Conditional on \( N \), \( M=\sum_{i=1}^N X_i \) has a binomial distribution since the \( X_i \)'s are independent Bernoulli variables and \( p \) is constant. That is, \( H=N+M \), where the annual number of clusters \( N \) has a Poisson distribution with cluster rate \( r \), and \( M \) has a binomial distribution with proportion \( p \) and size \( N \). Here the binomial distribution describes the number of occurrences of at least one hurricane in a sequence of \( N \) independent years, with each year having a probability \( p \) of observing at least one hurricane.

The model has two parameters \( r \) and \( p \). A better parameterization is to use \( \lambda = r(1+p) \) with \( p \) to separate the hurricane frequency from the cluster probability. The parameters do not need to be fixed and can be functions of the covariates. When \( p=0 \), \( H \) is Poisson, and when \( p=1 \), \( H/2 \) is Poisson, the dispersion is two, and the probability that \( H \) is even is 1. You need a way to estimate \( r \) and \( p \).

Parameter estimation

Your goal is a hurricane count distribution for Florida. For that you need an estimate of the annual cluster rate (\( r \)) and the probability (\( p \)) that the cluster size is two. Continuing with the GLM approach you separately estimate the annual hurricane frequency, \( \lambda \), and the annual cluster rate \( r \). The ratio of these two parameters minus one is an estimate of the probability \( p \).

This is reasonable if \( p \) does not vary much, since the annual hurricane count variance is proportional to the expected hurricane count [i.e., \( \mbox{var}(H) = r(1+3p) \propto r \propto E(H) \)]. You estimated the parameters of the annual count model using Poisson regression, which assumes that the variance of the count is, in fact, proportional to the expected count. Thus under the assumption that \( p \) is constant, Poisson regression can be used for estimating \( \lambda \) in the cluster model.

As before, you regress the logarithm of the link function for the cluster rate onto the predictors NAO and SOI. The parameters of this annual cluster count model cannot be estimated directly, since the observed hurricane count does not furnish information about the number of clusters.

Consider the observed set of annual Florida hurricane counts. Since the annual frequency is quite small, the majority of years have either no hurricanes or a single hurricane. You can create a reduced' data set by using an indictor of whether or not there was at least one hurricane. Formally let \( I_i=I(H_i > 0)=I(N_i > 0)) \), then \( I \) is an indicator of the occurrence of a hurricane cluster for each year. You assume \( I \) has a binomial distribution with size parameter of one and a proportion equal to \( \pi \). This leads to a logistic regression model (see Chapter~\ref{chap:frequencymodels}) for \( I \).

Note that since \( \exp(-r) \) is the probability of no clusters, the probability of a cluster \( \pi \) is \( 1 - \exp(-r) \). Thus the cluster rate is \( r = -\log(1-\pi) \). If you use a logarithmic link function on \( r \), then \( \log(r) = \log(-\log(1-\pi)) = \mbox{cloglog}(\pi) \), where cloglog is the complementary log-log function. Thus you model \( I \) using the cloglog function to obtain \( r \).

Your cluster model is a combination of two models, one for the counts another for the clusters.

You start by comparing fitted values from the count model with fitted values from the cluster model. Let \( H_i \) be the hurricane count in year \( i \) and \( \hat \lambda_i \) and \( \hat r_i \) be the fitted annual count and cluster rates, respectively. Then let \( \tau_0 \) be a test statistic given by \( \tau_0=\frac{1}{n}\sum_{i=1}^n (H_i-\hat{r_i}) = \frac{1}{n}\sum_{i=1}^n (\hat{\lambda}_i-\hat{r_i}) \).

The value of \( \tau_0 \) is greater than one if there is clustering. You test the significance of \( \tau_0 \) by generating random samples of length \( n \) from a Poisson distribution with rate \( \lambda_i \) and computing \( \tau_j \) for \( j =1, \ldots, N \), where \( N \) is the number of samples. A test of the null hypothesis that \( \tau_0 \le 0 \) is the proportion of simulated \( \tau \)'s that are at least as large as \( \tau_0 \).

You do this with the testfits() function in the correlationfuns.R package by specifying the model formula, data, and number of random samples.

tfF = testfits(F.1 ~ nao + soi, data = dat, N = 1000)
tfF$test

## [1] 0.1043

tfF$testpvalue

## [1] 0.036

For Florida hurricanes the test statistic \( \tau_0 \) has a value of 0.104 indicating some difference in count and cluster rates. The proportion of 1000 simulated \( \tau \)'s that are as least as large as this is 0.036 providing sufficient evidence to reject the no-cluster hypothesis. Repeating the simulation using Gulf Coast hurricanes

tfG = testfits(G.1 ~ nao + soi, data = dat, N = 1000)
tfG$test

## [1] -0.06174

tfG$testpvalue

## [1] 0.772

you find that, in contrast to Florida, there is little evidence against the no-cluster hypothesis.

A linear regression through the origin of the fitted count rate on the cluster rate under the assumption that \( p \) is constant yields an estimate for \( 1+p \). You plot the annual count and cluster rates and draw the regression line using the plotfits() function.

par(mfrow = c(1, 2), pty = "s")
ptfF = plotfits(tfF)
mtext("a", side = 3, line = 1, adj = 0, cex = 1.1)
ptfG = plotfits(tfG)
mtext("b", side = 3, line = 1, adj = 0, cex = 1.1)

plot of chunk plotCountVsClusterRates

The black line is the \( y=x \) line and you expect cluster and hurricane rates to align along this axis if there is no clustering. The red line is the regression of the fitted hurricane rate onto the fitted cluster rate with the intercept set to zero. The slope of the line is an estimate of \( 1+p \).

The regression slopes are printed by typing,

coefficients(ptfF)

##  rate 
## 1.138

coefficients(ptfG)

##   rate 
## 0.9425

The slope is 1.14 for the Florida region giving 0.14 as an estimate for \( p \) (probability the cluster will have two hurricanes). The regression slope is 0.94 for the Gulf coast region which you interpret as a lack of evidence for hurricane clusters.

Your focus is now on Florida hurricanes only. You continue by looking at the coefficients from both models. Type

summary(tfF$fits$poisson)$coef

##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.26820    0.10533  -2.546  0.01089
## nao         -0.22579    0.09023  -2.502  0.01234
## soi          0.05618    0.03026   1.856  0.06340

summary(tfF$fits$binomial)$coef

##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.41950    0.13476 -3.1129 0.001852
## nao         -0.26861    0.12059 -2.2275 0.025917
## soi          0.02123    0.04059  0.5232 0.600840

The output coefficient tables show that the NAO and SOI covariates are significant in the hurricane count model, but only the NAO is significant in the hurricane cluster model.

The difference in coefficient values from the two models is an estimate of \( \log(1+p) \), where again \( p \) is the probability that a cluster will have two hurricanes.

The difference in the NAO coefficient is 0.043 and the difference in the SOI coefficient is 0.035 indicating the NAO increases the probability of clustering more than ENSO. Lower values of the NAO lead to a larger rate increase for the Poisson model relative to the binomial model.

Forecasts

It is interesting to compare forecasts of the distribution of Florida hurricanes using a Poisson model and your cluster model. Here you set \( p=.138 \) for the cluster model. You can use the same two-component formulation for your Poisson model by setting \( p=0 \).

You prepare your data using the lambdapclust() function as follows.

ctsF = cts[, "F.1", drop = FALSE]
pars = lambdapclust(prsp[, "F.1", drop = FALSE], p = 0.138)
ny = nrow(ctsF)
h = 0:5

Next you compute the expected number of years with h hurricanes from the cluster and Poisson models and tabulate the observed number of years. You combine them in a data object.

eCl = sapply(h, function(x) sum(do.call("dclust", c(x = list(rep(x, ny)), pars))))
ePo = sapply(0:5, function(x) sum(dpois(x = rep(x, ny), lambda = prsp[, "F.1"])))
o = as.numeric(table(ctsF))
dat = rbind(o, eCl, ePo)
names(dat) = 0:5

Finally you plot the observed versus the expected from the cluster and Poisson models using a bar plot where the bars are plotted side-by-side.

barplot(dat, ylab = "Number of Years", xlab = "Number of Florida Hurricanes", 
    names.arg = c(0:5), col = c("black", "red", "blue"), legend = c("Observed", 
        "Cluster", "Poisson"), beside = TRUE)

plot of chunk sideBySideBoxplot

The expected numbers are based on a cluster model (\( p \) = 0.137) and on a Poisson model (\( p \) = 0). The cluster model fits the observed counts better than does the Poisson model particularly at the low and high count years.

Florida had hurricanes in only two of the 11 years from 2000 through 2010. But these two years featured seven hurricanes. Seasonal forecast models that predict U.S. hurricane activity assume a Poisson distribution.

You show here this assumption applied to Florida hurricanes leads to a forecast that under predicts both the number of years without hurricanes and the number of years with three or more hurricanes (Jagger and Elsner 2012).

The lack of fit in the forecast model arises due to clustering of hurricanes along this part of the coast. You demonstrate a temporal cluster model that assumes the rate of hurricane clusters follows a Poisson distribution with the size of the cluster limited to two hurricanes. The model fits the distribution of Florida hurricanes better than a Poisson model when both are conditioned on the NAO and SOI.

Spatial Clusters

Is there a tendency for hurricanes to cluster in space? More specifically, given that a hurricane originates at a particular location is it more (or less) likely that another hurricanes will form in the same vicinity? This is a problem for spatial point pattern analysis. Here you consider models for analyzing and modeling events across space.

An event is the occurrence of some phenomenon of interest. For example, an event can be a hurricane's origin or its lifetime maximum intensity.

An event location is the spatial location of the event. For example, the latitude and longitude of the maximum intensity event.

A point is any location where an event could occur. A spatial point pattern is a collection of events and event locations together with spatial domain.

The spatial domain is defined as the region of interest over which events tend to occur. For example, the North Atlantic basin is the spatial domain for hurricanes.

To define spatial clustering, it helps to define its absence. A spatial point pattern is defined as random if an event is equally likely to occur at any point within the spatial domain.

A spatial point pattern is said to be clustered if given an event at some location it is more likely than random that another event will occur nearby. Regularity is the opposite; given an event, if it is less likely than random that another event will occur nearby, then the spatial point pattern is regular.

Said another way, complete spatial randomness (CSR) defines a situation where an event is equally likely to occur at any location within the study area regardless of the locations of other events.

A realization is a collection of spatial point patterns generated under a spatial point process model. To illustrate, consider four point patterns each consisting of events inside the unit plane. You generate the event locations and plot them by typing

par(mfrow = c(2, 2), mex = 0.9, pty = "s")
for (i in 1:4) {
    u = runif(n = 30, min = 0, max = 1)
    v = runif(n = 30, min = 0, max = 1)
    plot(u, v, pch = 19, xlim = c(0, 1), ylim = c(0, 1))
}

plot of chunk csrExamples

The patterns illustrate that some amount of clustering occurs by chance.

The spatstat package (Baddeley and Turner 2005) contains a large number of functions for analyzing and modeling point pattern data. To make the functions available and obtain a citation type

require(spatstat)
citation(package = "spatstat")

## 
## To cite 'spatstat' in publications, please use:
## 
##   A. Baddeley and R. Turner (2005). Spatstat: an R package for
##   analyzing spatial point patterns. Journal of Statistical
##   Software 12 (6), 1-42. ISSN: 1548-7660. URL: www.jstatsoft.org
## 
## A BibTeX entry for LaTeX users is
## 
##   @Article{,
##     title = {Spatstat: an {R} package for analyzing spatial point patterns},
##     author = {Adrian Baddeley and Rolf Turner},
##     journal = {Journal of Statistical Software},
##     volume = {12},
##     number = {6},
##     pages = {1--42},
##     year = {2005},
##     note = {{ISSN} 1548-7660},
##     url = {www.jstatsoft.org},
##   }
## 
## For references to additional papers describing the use of
## 'spatstat' see 'help("spatstat").'

Complete spatial randomness lies on a continuum between clustered and regular spatial point patterns. You use simulation functions to compare the CSR plots with regular and cluster patterns.

par(mfrow = c(2, 2), mex = 0.5)
plot(rMaternI(kappa = 80, r = 0.05), pch = 19, main = "")
plot(rMaternI(kappa = 80, r = 0.05), pch = 19, main = "")
plot(rMatClust(kappa = 20, r = 0.15, mu = 4), pch = 19, main = "")
plot(rMatClust(kappa = 20, r = 0.15, mu = 4), pch = 19, main = "")

plot of chunk clusterExamples

Here you can see two realizations of patterns more regular than CSR (top row), and two of patterns more clustered than CSR (bottom row). The simulations are made using a spatial point pattern model. Spatial scale plays a role. A set of events can indicate a regular spatial pattern on one scale but a clustered pattern on anther.

Point processes

Given a set of spatial events (spatial point pattern) your goal is to assess evidence for clustering (or regularity). Spatial cluster detection methods are based on statistical models for spatial point processes.

The random variable in these models is the event locations. Here we provide some definitions useful for understanding point process models. Details on the underlying theory are available in Ripley (1981), Cressie (1993), and Diggle (2003).

A process generating events is stationary when it is invariant to translation across the spatial domain. This means that the relationship between two events depends only on their positions relative to one another and not on the event locations themselves.

This relative position refers to (lag) distance and orientation between the events. A process is isotropic when orientation does not matter.

Given a single realization, the assumptions of stationarity and isotropy allow for replication. Two pairs of events in the same realization of a stationary process that are separated by the same distance should have the same relatedness.

This allows you to use relatedness information from one part of the domain as a replicate for relatedness from another part of the domain. The assumptions of stationarity and isotropy let you begin and can be relaxed later.

The Poisson distribution is used to define a model for CSR. A spatial point process is said to be homogeneous Poisson under the following two criteria: (1) The number of events occurring within a finite region \( A \) is a random variable following a Poisson distribution with mean \( \lambda \times A \) for some positive constant \( \lambda \), with \( |A| \) denoting the area of \( A \) and (2) Given the total number of events \( N \), their locations are an independent random sample of points where each point is equally likely to be picked as an event location.

The first criteria is about the density of the spatial process. For a given domain it answers the question, how many events? It is the number of events divided by the domain area. The density is an estimate of the rate parameter of the Poisson distribution.

Note that in spatial statistics this is often called the intensity. Here we use the term density instead so as not to confuse the spatial rate parameter with hurricane strength.

The second criteria is about homogeneity. Events are scattered throughout the domain and are neither clustered or regular.

It helps to consider how to create a homgeneous Poisson point pattern. The procedure follows straight from the definition. Step 1: Generate the number of events using a Poisson distribution with mean equal to \( \lambda \). Step 2: Place the events inside the domain using a uniform distribution for both spatial coordinates.

For example, let area of the domain be one and the density of events be 200, then you type

lambda = 200
N = rpois(1, lambda)
x = runif(N)
y = runif(N)
plot(x, y, pch = 19)

plot of chunk poissonCount

Note that if your domain is not a rectangle, you can circumscribe it within a rectangle, then reject events sampled from inside the rectangle that fall outside the domain.

By definition these point patterns are CSR. However, as noted above you are typically in the opposite position. You observe a set of events and want to know if they are regular or clustered.

Your null hypothesis is CSR and you need a test statistic that will guide your inference. The above demonstrates the null models are easy to construct, so you can use Monte Carlo methods.

In some cases the homogeneous Poisson model is too restrictive. Consider hurricane genesis as an event process. Event locations may be random but the probability of an event is higher away from the equator. The constant risk hypothesis undergirding the homogeneous Poisson point pattern model requires a generalization to include a spatially varying density function.

To do this you define the density \( \lambda(s) \), where \( s \) denotes spatial points. This is called a inhomogeneous Poisson model and it is analogous to the universal kriging model used on field data. Inhomogeneity as defined by a spatially varying density implies non-stationarity as the number of events depends on location.

Spatial density

Density is a first-order property of a spatial point pattern. That is; the density function estimates the mean number of events at any point in your domain. Events are independent of one another, but event clusters appear because of the varying density.

Given an observed set of events, how do you estimate \( \lambda(s) \)? One approach is to use kernel densities. Consider the set of hurricanes over the period 1944–2000 that were designated tropical only and baroclinically enhanced. Input these data and create a spatial points data frame of the genesis locations by typing

bh = read.csv("bi.csv", header = TRUE)
require(sp)
coordinates(bh) = c("FirstLon", "FirstLat")
ll = "+proj=longlat +ellps=WGS84"
proj4string(bh) = CRS(ll)

Next convert the geographic coordinates using the Lambert conformal conic projection true at parallels 10 and 40 \( ^\circ \) N and a center longitude of 60 \( ^\circ \) W. First save the reference system as a CRS object then use the spTransform() function from the rgdal package.

lcc = "+proj=lcc +lat_1=40 +lat_2=10 +lon_0=-60"
require(rgdal)
bht = spTransform(bh, CRS(lcc))

The spatial distance unit is meters. Use the map() (maps) and the map2SpatialLines() (maptools) to obtain country borders by typing

require(maps)
require(maptools)
brd = map("world", xlim = c(-100, -30), ylim = c(10, 48), plot = FALSE)
brd_ll = map2SpatialLines(brd, proj4string = CRS(ll))
brd_lcc = spTransform(brd_ll, CRS(lcc))

You use the same coordinate transformation on the map borders as you do on the cyclone locations.

Next you need to convert your S4 class objects (bh and clp) into S3 class objects for use with the functions in the spatstat package.

require(spatstat)
bhp = as.ppp(bht)
clpp = as.psp(brd_lcc)

The spatial point pattern object bhp contains marks. A mark is attribute information at each event location. The marks are the columns from the original data that were not used for location.

You are interested only in the hurricane type (either tropical only or baroclinic) so you reset the marks accordingly by typing

marks(bhp) = bht$Type == 0

You summarize the object with the summary() method. The summary includes an average density over the spatial domain. The density is per unit area.

Your native length unit is meter from the Lambert planar projection so your density is per square meter. You retrieve the average density in units of per (1000 km)\( ^2 \) by typing

summary(bhp)$intensity * 1e+12

## [1] 11.42

Thus, on average, each point in your spatial domain has slightly more than ten hurricane origins per one thousand squared kilometers. This is the mean spatial density. Some areas have more or less than the mean so you would like an estimate of \( \lambda(s) \).

You do this using the density() method, which computes a kernel smoothed density function from your point pattern object. By default the kernel is gaussian with a bandwidth equal to the standard deviation of the isotropic kernel.

The default output is a pixel image containing local density estimates. Here again you convert the density values to have units of 1000 squared kilometers.

den = density(bhp)
den$v = den$v * 1e+12

You use the plot() method first to plot the density image, then the country border, and finally the event locations.

plot(den)
plot(unmark(clpp), add = TRUE)
plot(unmark(bhp), pch = 19, cex = 0.3, add = TRUE)

plot of chunk densityMap

Event density maps are split by tropical only and baroclinic hurricane genesis.

Here we convert the im object to a SpatialGridDataFrame object and used the spplot() method.

to = bhp[bhp$marks == TRUE]
be = bhp[bhp$marks == FALSE]
den1 = density(to)
den1$v = den1$v * 1e+12
den.sgdf = as(den1, "SpatialGridDataFrame")
den2 = density(be)
den2$v = den2$v * 1e+12
den.sgdf$v2 = as(den2, "SpatialGridDataFrame")$v
l3 = list("sp.lines", brd_lcc, lwd = 0.3, first = FALSE)
p1 = spplot(den.sgdf, "v", col.regions = rev(topo.colors(20)), sp.layout = l3, 
    at = seq(0, 30, 5), colorkey = list(space = "bottom", labels = list(cex = 0.7)), 
    sub = list("Density [Events/(1000 km)$^2$]", cex = 0.7, font = 1))
p2 = spplot(den.sgdf, "v2", col.regions = rev(topo.colors(20)), sp.layout = l3, 
    at = seq(0, 30, 5), colorkey = list(space = "bottom", labels = list(cex = 0.7)), 
    sub = list("Density [Events/(1000 km)$^2$]", cex = 0.7, font = 1))
require(grid)
p1 = update(p1, main = textGrob("a", x = unit(0.05, "npc")))
p2 = update(p2, main = textGrob("b", x = unit(0.05, "npc")))
plot(p1, split = c(1, 1, 2, 1), more = TRUE)
plot(p2, split = c(2, 1, 2, 1), more = FALSE)

plot of chunk densityMaps2

Regions of the central Atlantic extending westward through the Caribbean into the southern Gulf of Mexico have the greatest tropical-only density generally exceeding ten hurricane origins per (1000~km)\( ^2 \).

In contrast, regions off the eastern coast of the United States extending eastward to Bermuda have the greatest baroclinic density. The amount of smoothing is determined to a large extent by the bandwidth and to a much smaller extent by the type of kernel.

Densities at the genesis locations are made using the argument at=“points” in the density() function. Also, the number of events in grid boxes across the spatial domain can be obtained using the quadratcount() or pixellate() functions. For example, type

plot(quadratcount(bhp))

plot of chunk quadratPixellate

plot(pixellate(bhp, dimyx = 5))

plot of chunk quadratPixellate

Second-order properties

The density function above describes the rate (mean) of hurricane genesis events locally. Second-order functions describe the variability of events. Ripley's K function is an example. It is defined as \[ \hat{K}(s) = \lambda^{-1}n^{-1}\sum_{i\ne j} I(d_{ij}

The definition assumes stationarity and isotropy for the point pattern and implies that under CSR if the events are homogeneous \( \hat{K}(s) \) should be approximately equal to \( \pi s^2 \). For point patterns more regular than CSR, you expect fewer events within distance \( s \) of a randomly chosen event than under CSR, so \( \hat{K}(s) < \pi s^2 \). For point patterns more clustered than CSR, you expect more events within a given distance than under CSR, or \( \hat{K}(s) > \pi s^2 \).

The function Kest() is used to compute \( \hat{K}(s) \). Here you save the results by typing

k = Kest(bhp)

The function takes an object of class ppp and computes \( \hat{K}(s) \) using different formulas that correct for edge effects to reduce bias arising from counting hurricanes near the borders of your spatial domain (Ripley 1991, Baddeley et al. 2000).

The K function is defined such that \( \lambda \hat{K}(s) \) equals the expected number of additional hurricanes within a distance \( s \) of any other hurricane. You plot the expected number as a function of separation distance by typing

lam = summary(bhp)$intensity
m2km = 0.001
plot(k$r * m2km, k$iso * lam, type = "l", lwd = 2, xlab = "Lag Distance (s) [km]", 
    ylab = "Avg Number of Nearby Hurricanes")

You first save the value of \( \lambda \) and a conversion factor to go from meters to kilometers. The iso column in results object refers to the K function computed using the isotropic correction for regular polygon domains.

The Kest() function also returns theoretical values under the hypothesis of CSR. You add these to your plot by typing

lines(k$r * m2km, k$theo * lam, col = "red", lwd = 2)

The empirical curve lies above the theoretical curve at all lag distances. For instance at a lag distance of


Error in unique(c("AsIs", oldClass(x))) : object 'm2km' not found

km, there are on average about


Error in unique(c("AsIs", oldClass(x))) : object 'lam' not found

additional hurricanes nearby. This compares with an expected number of


Error in unique(c("AsIs", oldClass(x))) : object 'lam' not found

additional hurricanes.

This indicates that for a given random hurricane, there are more nearby hurricanes than you would expect by chance under the assumption of CSR. This is indicative of clustering.

But as you show above, the clustering is related to the inhomogeneous distribution of hurricanes events across the basin. So this result is not particularly useful.

Instead you use the Kinhom() function to compute a generalization of the K function for inhomogeneous point patterns (Baddeley et al. 2000).

m2km = 0.001
den = density(bhp)
k = Kest(bhp)
k2 = Kinhom(bhp, lambda = den)
# e = envelope(bhp, Kinhom, nsim=99, verbose=FALSE)
lam = summary(bhp)$intensity
par(mfrow = c(1, 2), las = 1, mgp = c(2, 0.4, 0), tcl = -0.3)
plot(k$r * m2km, k$iso * lam, type = "l", lwd = 2, ylim = c(0, 55), xlab = "Lag distance (s) [km]", 
    ylab = "Avg number of nearby hurricanes")
grid()
lines(k$r * m2km, k$iso * lam)
lines(k$r * m2km, k$theo * lam, col = "red", lwd = 2)
mtext("a", side = 3, line = 1, adj = 0, cex = 1.1)
plot(k2$r * m2km, k2$iso * lam, type = "l", lwd = 2, ylim = c(0, 55), xlab = "Lag distance (s) [km]", 
    ylab = "Avg number of nearby hurricanes")
grid()
# xx = c(e$r * m2km, rev(e$r * m2km)) yy = c(e$lo * lam, rev(e$hi * lam))
# polygon(xx, yy, border=NA, col='gray')
lines(k2$r * m2km, k2$iso * lam, lwd = 2)
lines(k2$r * m2km, k2$theo * lam, col = "red", lwd = 2)
mtext("b", side = 3, line = 1, adj = 0, cex = 1.1)

plot of chunk KfunctionPlot2

Black curves are the empirical estimates and red curves are the theoretical. The empirical curve is much closer to the inhomogeneous theoretical curve.

There appears to be some clustering of hurricane genesis at short distances and regularity at larger distances. The regularity at large distance is likely due to land. The analysis can be improved by masking land areas rather than using a rectangular window.

Models

You can a fit parametric model to your point pattern. This allows you to predict the occurrence of hurricane events given covariates and the historical spatial patterns including clustering and regularity. The scope of models is quite wide, but the spatstat package contains quite a few options using the ppm() function.

Models can include spatial trend, dependence on covariates, and event interactions of any order (you are not restricted to pairwise interactions). Models are fit using the method of maximum pseudo-likelihood, or using an approximate maximum likelihood method. Model syntax is standard R. Typically the null model is homogeneous Poisson. Tornado touchdown points are modeled in this way in Elsner et al. (2013).

Feature Clusters

In the first two sections you examined clustering in time and geographic space. It is also possible to cluster in feature space. Indeed, this is the best known of the cluster methods. Feature clustering is called cluster analysis.

Cluster analysis searches for groups in feature (attribute) data. Objects belonging to the same cluster have similar features. Objects belonging to different clusters have dissimilar features. In two or three dimensions, clusters can be visualized. In more than three dimensions analytic assistance is helpful.

Note the difference with the two previous cluster methods. With those methods your initial goal was cluster detection with the ultimate goal to use the clusters to improve prediction.

Here your goal is descriptive and cluster analysis is a data-reduction technique. You start with the assumption that your data can be grouped based on feature similarity.

Cluster analysis methods fall into two distinct categories: partition and hierarchical. Partition clustering divides your data into a pre-specified number of groups. To help you decide on the number of clusters you often have to try several different numbers. An index of cluster quality might provide an easy call as to the 'best' number.

Hierarchical clustering creates increasing or decreasing sets of clustered groups from your data. Agglomerative hierarchical starts with each object forming its own individual cluster. It then successively merges clusters until a single large cluster remains (your entire data). Divisive hierarchical is just the opposite. It starts with a single super cluster that includes all objects and proceeds to successively split the clusters into smaller groups.

To cluster in feature space you need to: Create a dissimilarity matrix from a set of objects, and group the objects based on the values of the dissimilarity matrix.

Dissimilarity and distance

The dissimilarity between two objects measures how different they are. The larger the dissimilarity the greater the difference. Objects are considered vectors in attribute (feature) space, where vector length equals the number of data set attributes.

Consider for instance a data set with three attributes and four objects. & Feature 1 & Feature 2 & Feature 3
Object 1 & \( x_{1,1} \) & \( x_{1,2} \) & \( x_{1,3} \)
Object 2 & \( x_{2,1} \) & \( x_{2,2} \) & \( x_{2,3} \)
Object 3 & \( x_{3,1} \) & \( x_{3,2} \) & \( x_{3,3} \)
Object 4 & \( x_{4,1} \) & \( x_{4,2} \) & \( x_{4,3} \)

Object 1 is a vector consisting of the triplet (\( x_{1,1} \), \( x_{1,2} \), \( x_{1,3} \)), object 2 is a vector consisting of the triplet (\( x_{2,1} \), \( x_{2,2} \), \( x_{2,3} \)), and so on. Then the elements of a dissimilarity matrix are pairwise distances between the objects in feature space.

As an example, an object might be a hurricane track with features that include genesis location, lifetime maximum intensity, and lifetime maximum intensification.

Although distance is an actual metric, the dissimilarity function need not be. The minimum requirements for a dissimilarity measure \( d \) are: \( d_{i,i} = 0 \)
\( d_{i,j} \ge 0 \)
\( d_{i,j} = d_{j,i} \)

The following axioms of a proper metric do not need to be satisfied: \( d_{i,k} \leq d_{i,j} + d_{j,k} \) triangle inequality
\item \( d_{i,j}=0 \) implies \( i=j \)

Before clustering you need to arrange your data set as a \( n \times p \) data matrix, where \( n \) is the number of rows (one for each object) and \( p \) is the number of columns (one for each attribute variable).

How you compute dissimilarity depends on your attribute variables. If the variables are numeric you use Euclidean or Manhattan distance given as \[ d^E_{i,j}=\sqrt{\sum_{f=1}^p (x_{i,f}-x_{j,f})^2} \\ d^M_{i,j}=\sum_{f=1}^p |x_{i,f}-x_{j,f}| \]

The summation is over the number of attributes (features). With hierarchical clustering the maximum distance norm, given by \[ d^{\hbox{max}}_{i,j} = \hbox{max}(|x_{i,f}-x_{j,f}|, f=1, \ldots, p) \] is sometimes used.

Measurement units on your feature variables influence the relative distance values which, in turn, will affect your clusters. Features with high variance will have the largest impact.

If all features are deemed equally important to your grouping, then the data need to be standardized. You can do this with the scale() function, whose default method centers and scales the columns of your data matrix.

The dist() function computes the distance between objects using a variety of methods including Euclidean (default), Manhattan, and maximum. Create two feature vectors each having five values and plot them with object labels in feature space.

x1 = c(2, 1, -3, -2, -3)
x2 = c(1, 2, -1, -2, -2)
plot(x1, x2, xlab = "Feature 1", ylab = "Feature 2")
text(x1, x2, labels = 1:5, pos = c(1, rep(4, 4)))

plot of chunk featureVectors

From the plotted points you can easily group the two features into two clusters by eye. There is an obvious distance separation between the clusters. To actually compute pairwise Euclidean distances between the five objects, type

d = dist(cbind(x1, x2))
d

##       1     2     3     4
## 2 1.414                  
## 3 5.385 5.000            
## 4 5.000 5.000 1.414      
## 5 5.831 5.657 1.000 1.000

The result is a vector of distances, but printed as a table with the object numbers listed as row and column names. The values are the pairwise distances so for instance the distance between object 1 and object 2 is 1.41 units.

You can see two distinct clusters of distances (dissimilarities) those less or equal to 1.41 and those greater or equal to 5. Objects 1 and 5 are the most dissimilar followed by objects 2 and 5. On the other hand, objects 3 and 5 and 4 and 5 are the most similar. You can change the default Euclidean distance to the Manhattan distance by including method=“man” in the dist() function.

If the feature vectors contain values that are not continuous numeric (e.g., factors, ordered, binary) or if there is a mixture of data types (one feature is numeric the other is a factor, for instance), then dissimilarities need to be computed differently.

The cluster package contains functions for cluster analysis including daisy() for dissimilarity matrix calculations, which by default uses Euclidean distance as the dissimilarity metric.

To test, type

require(cluster)
d = daisy(cbind(x1, x2))
d

## Dissimilarities :
##       1     2     3     4
## 2 1.414                  
## 3 5.385 5.000            
## 4 5.000 5.000 1.414      
## 5 5.831 5.657 1.000 1.000
## 
## Metric :  euclidean 
## Number of objects : 5

Note the values that make up the dissimilarity matrix are the same as the distance values above, but additional information is saved including the metric used and the total number of objects in your data set.

The function contains a logical flag called stand that when set to true standardizes the feature vectors before calculating dissimilarities. If some of the features are not numeric, the function computes a generalized coefficient of dissimilarity (Gower 1971).

K-means clustering

Partition clustering requires you to specify the number of clusters beforehand. This number is denoted \( k \). The algorithm allocates each object in your data frame to one and only one of the \( k \) clusters.

The \( k \)-means method is the most popular. Membership of an object is determined by its distance from the centroid of each cluster. The centroid is the multidimensional version of the mean. The method alternates between calculating the centroids based on the current cluster members, and reassigning objects to clusters based on the new centroids.

For example, in deciding which of the two clusters an object belongs, the method computes the dissimilarity between the object and the centroid of cluster one and between the object and the centroid of cluster two. It then assigns the object to the cluster with the smallest dissimilarity and recomputes the centroid with this new object included.

An object already in this cluster might now have greater dissimilarity due to the reposition of the centroid, in which case it gets reassigned. Assignments and reassignments continue in this way until all objects have a membership.

The kmeans() function from the base stat package performs \( k \)-means clustering. By default it uses the algorithm of Hartigan and Wong (1979). The first argument is the data frame (not the dissimilarity matrix) and the second is the number of clusters (centers). To perform a \( k \)-means cluster analysis on the example data above, type

dat = cbind(x1, x2)
ca = kmeans(dat, centers = 2)
ca

## K-means clustering with 2 clusters of sizes 2, 3
## 
## Cluster means:
##       x1     x2
## 1  1.500  1.500
## 2 -2.667 -1.667
## 
## Clustering vector:
## [1] 1 1 2 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 1.000 1.333
##  (between_SS / total_SS =  93.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"

Initial centroids are chosen at random so it is recommended to rerun the algorithm a few times to see if it arrives at the same groupings. While the algorithm minimizes within-cluster variance, it does not ensure that there is a global minimum variance across all clusters.

The first bit of output gives the number of clusters (input) and the size of the clusters that results. Here cluster 1 has two members and cluster 2 has three. The next bit of output are the cluster centroids.

There are two features labeled x1 and x2. The rows list the centroids for clusters 1 and 2 as vectors in this two-dimensional feature space. The centroid of the first cluster is (-2.67, -1.67) and the centroid of the second cluster is (1.5, 1.5).

The centroid is the feature average using all objects in the cluster. You can see this by adding the centroids to your plot.

plot(x1, x2, xlab = "Feature 1", ylab = "Feature 2")
text(x1, x2, labels = 1:5, pos = c(1, rep(4, 4)))
points(ca$centers, pch = 8, cex = 2)
text(ca$centers, pos = c(4, 1), labels = c("Cluster 2", "Cluster 1"))

plot of chunk addClusterCentroids

The next bit of output tags each object with cluster membership. Here you see the first two objects belong to cluster 2 and the next three belong to cluster 1. The cluster number keeps track of distinct clusters but the numerical order is irrelevant.

The within-cluster sum of squares is the sum of the distances between each object and the cluster centroid for which it is a member. From the plot you can see that the centroids (stars) minimize the within cluster distances while maximizing the between cluster distances.

The function pam() is similar but uses medoids rather than centroids. A medoid is an multidimensional center based on medians. The method accepts a dissimilarity matrix, tends to be more robust (converges to the same result), and provides for additional graphical displays from the cluster package.

Track clusters

Your interest is to group hurricanes according to common track features. These features include location of hurricane origin, location of lifetime maximum intensity, location of final hurricane intensity, and lifetime maximum intensity.

The three location features are further subdivided into latitude and longitude making a total of seven attribute (feature) variables. Note that location is treated here as an attribute rather than as a spatial coordinate.

You first create a data frame containing only the attributes you wish to cluster. Here you use the cyclones since 1950.

load("best.use.RData")
best = subset(best.use, Yr >= 1950)

You split the data frame into separate cyclone tracks using Sid with each element in the list as a separate track. You are interested in identifying features associated with each track.

best.split = split(best, best$Sid)

You then assemble a data frame using only the attributes to be clustered.

best.c = t(sapply(best.split, function(x) {
    x1 = unlist(x[1, c("lon", "lat")])
    x2 = unlist(x[nrow(x), c("lon", "lat")])
    x3 = max(x$WmaxS)
    x4 = unlist(x[rev(which(x3 == x$WmaxS))[1], c("lon", "lat")])
    return(c(x1, x2, x3, x4))
}))
best.c = data.frame(best.c)
colnames(best.c) = c("FirstLon", "FirstLat", "LastLon", "LastLat", "WmaxS", 
    "MaxLon", "MaxLat")

The data frame contains 7 features from 667 cyclones.

You check the feature variances by applying the var() function on the columns of the data frame using the sapply() function.

sapply(best.c, var)

## FirstLon FirstLat  LastLon  LastLat    WmaxS   MaxLon   MaxLat 
##   499.21    57.04   710.41   164.82   870.69   356.72    75.90

Variances range from a minimum of 57 degrees squared for the latitude of origin feature to a maximum of 870.7 knots squared for the maximum intensity feature.

Because of the large range in variances you scale the features before performing cluster analysis. This is needed if your intent is to have each feature have the same influence on the clustering.

best.cs = scale(best.c)
m = attr(best.cs, "scaled:center")
s = attr(best.cs, "scaled:scale")

By default the function scale() centers and scales the columns of your numeric data frame. The center and scale values are saved as attributes in the new data frame. Here you save them to rescale the centroids after the cluster analysis.

You perform a \( k \)-means cluster analysis setting the number of clusters to three by typing

k = 3
ct = kmeans(best.cs, centers = k)
summary(ct)

##              Length Class  Mode   
## cluster      667    -none- numeric
## centers       21    -none- numeric
## totss          1    -none- numeric
## withinss       3    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           3    -none- numeric

The output is a list of length seven containing the cluster vector, cluster means (centroids), the total sum of squares, the within cluster sum of squares by cluster, the total within sum of squares, the between sum of squares, and the size of each cluster.

Your cluster means are scaled so they are not readily meaningful. The cluster vector gives the membership of each hurricane in order as they appear in your data set. The ratio of the between sum of squares to the total sum of squares is 44.7%.

This ratio will increase with the number of clusters, but at the expense of having clusters that are not physically interpretable. With four clusters, the increase in this ratio is smaller than the increase going from two to three clusters. So you are content with your three-cluster solution.

Track plots

Since six of the seven features are spatial coordinates it is tempting to plot the centroids on a map and connect them with a track line. This would be a mistake as there is not a geographic representation of your feature clusters. Instead, you plot examples of cyclones that resemble each cluster.

First, you add cluster membership and distance to your original data frame then split the data by cluster member.

ctrs = ct$center[ct$cluster, ]
cln = ct$cluster
dist = rowMeans((best.cs - ctrs)^2)
id = 1:length(dist)
best.c = data.frame(best.c, id = id, dist = dist, cln = cln)
best.c.split = split(best.c, best.c$cln)

Next you subset your cluster data based on the tracks that come closest to the cluster centroids. This closeness is in feature space that includes latitude and longitude but also intensity. Here you choose 9 tracks for each centroid.

te = 9
bestid = unlist(lapply(best.c.split, function(x) x$id[order(x$dist)[1:te]]))
cinfo = subset(best.c, id %in% bestid)

Finally you plot the tracks on a map. This requires a few steps to make the plot easier to read. Begin by setting the bounding box using latitude and longitude of your cluster data and renaming your objects.

cyclones = best.split
uselat = range(unlist(lapply(cyclones[cinfo$id], function(x) x$lat)))
uselon = range(unlist(lapply(cyclones[cinfo$id], function(x) x$lon)))

Next, order the clusters by number and distance and set the colors for plotting. Use the brewer.pal() function in the RColorBrewer package (Neuwirth 2011). You use a single hue sequential color ramp that allows you to highlight the cyclone tracks that are closest to the feature clusters.

cinfo = cinfo[order(cinfo$cln, cinfo$dist), ]
require(RColorBrewer)
blues = brewer.pal(te, "Blues")
greens = brewer.pal(te, "Greens")
reds = brewer.pal(te, "Reds")
cinfo$colors = c(rev(blues), rev(greens), rev(reds))

Next, reorder the tracks for plotting them as a weave with the tracks farthest from the centroids plotted first.

cinfo = cinfo[order(-cinfo$dist, cinfo$cln), ]

Finally, plot the tracks and add world and country borders.

require(maps)
plot(uselon, uselat, type = "n", xaxt = "n", yaxt = "n", ylab = "", xlab = "")
map("world", add = TRUE)
map("usa", add = TRUE)
for (i in 1:nrow(cinfo)) {
    cid = cinfo$id[i]
    cyclone = cyclones[[cid]]
    n = length(cyclone$lon)
    lines(cyclone$lon, cyclone$lat, lwd = 2, col = cinfo$colors[i])
    arrows(cyclone$lon[n - 1], cyclone$lat[n - 1], cyclone$lon[n], cyclone$lat[n], 
        lwd = 2, length = 0.1, col = cinfo$colors[i])
}

plot of chunk plotTracksByClusterType

Track color is based on attribute proximity to the cluster centroid using a color saturation that decreases with distance.

The analysis splits the cyclones into a group over the Gulf of Mexico, a group at high latitudes, and a group that begins at low latitudes but ends at high latitudes generally east of the United States. Some of the cyclones that begin at high latitude are baroclinic.

Cluster membership can change depending on the initial random centroids particularly for tracks that are farthest from a centroid. The two- and three-cluster tracks are the most stable. The approach can be extended to include other track features including velocity, acceleration, and curvature representing more of the space and time behavior of hurricanes.

Model-based clustering, where the observations are assumed to be quantified by a finite mixture of probability distributions, is an attractive alternative if you want to make inferences about future hurricane activity.

In the end it's worthwhile to keep in mind the advice of Bill Venables and Brian Ripley. You should not assume cluster analysis is the best way to discover interesting groups. Indeed, visualization methods are often more effective in this task.