Chapter 11: Cluster Models

“There are in fact two things, science and opinion; the former begets knowledge, the latter, ignorance.”—Hippocrates

• required packages: maps, spatstat, sp, rgdal, maptools, cluster, xtable, RColorBrewer
• source code: source(“correlationfuns.R”)

Begin by loading annual.RData. These data were assembled in Chapter 5 of Elsner and Jagger (2013). Subset the data for years starting with 1866.

load("annual.RData")
dat = subset(annual, Year >= 1866)


The covariate Southern Oscillation Index (SOI) data begin in 1866. Next, extract all hurricane counts for the Gulf coast, Florida, and East coast regions.

cts = dat[, c("G.1", "F.1", "E.1")]


Cluster detection

Start by comparing the observed with the expected number of years for the two groups of hurricane counts. The groups include years with no hurricanes and years with three or more. The expected number is from a Poisson distribution with a constant rate.

The idea is that for regions that show a cluster of hurricanes, the observed number of years with no hurricanes and years with three or more hurricanes should be greater than the corresponding expected number. Said another way, a Poisson model with a hurricane rate estimated from counts over all years in regions with hurricane clustering will under estimate the number of years with no hurricanes and years with many hurricanes.

For example, you find the observed number of years without a Florida hurricane and the number of years with more than two hurricanes by typing

obs = table(cut(cts$F.1, breaks = c(-0.5, 0.5, 2.5, Inf))) obs  ## ## (-0.5,0.5] (0.5,2.5] (2.5,Inf] ## 70 62 13  And the expected numbers for these three groups by typing n = length(cts$F.1)
mu = mean(cts$F.1) exp = n * diff(ppois(c(-Inf, 0, 2, Inf), lambda = mu)) exp  ## [1] 61.655 75.271 8.074  You use functions in the correlationfuns.R package to get the observed and expected counts for the three regions as a table. source("correlationfuns.R") mu = colMeans(cts) rt = regionTable(cts, mu)  In the Gulf and East coast regions the observed number of years are relatively close to the expected number of years in each of the groups. In contrast, in the Florida region you see the observed number of years exceeds the expected number of years for the no-hurricane and the three-or-more hurricanes groups. The difference between the observed and expected numbers in each region is used to assess the statistical significance of the clustering. This is done using Pearson residuals and chi-squared statistic. The Pearson residual is the difference between the observed count and expected rate divided by the square root of the variance. The p-value is evidence in support of the null hypothesis of no clustering as indicated by no difference between the observed and expected numbers in each group. For example, to obtain the chi-squared statistic, type xis = sum((obs - exp)^2/exp) xis  ## [1] 6.475  The p-value as evidence in support of the null hypothesis is given by pchisq(q = xis, df = 2, lower.tail = FALSE)  ## [1] 0.03926  where df is the degrees of freedom equal to the number of groups minus one. The p-values for the Gulf and East coasts are greater than .05 indicating little support for the cluster hypothesis. In contrast the p-value for the Florida region is 0.009 using the Pearson residuals and 0.047 using the chi-squared statistic. These values provide evidence the hurricane occurrences in Florida are grouped in time. Conditional counts What might be causing this grouping? The extra variation in annual hurricane counts might be due to variation in hurricane rates. You examine this possibility with a Poisson regression model. The model includes an index for the North Atlantic Oscillation (NAO) and the Southern Oscillation (SOI) after Elsner and Jagger (2006). This is a generalized linear model (GLM) approach using the Poisson family that includes a logarithmic link function for the rate. The code to fit the model, determine the expected count, and table the observed versus expected counts for each region is given by pfts = list(G.1 = glm(G.1 ~ nao + soi, family = "poisson", data = dat), F.1 = glm(F.1 ~ nao + soi, family = "poisson", data = dat), E.1 = glm(E.1 ~ nao + soi, family = "poisson", data = dat)) prsp = sapply(pfts, fitted, type = "response") rt = regionTable(cts, prsp, df = 3) rt  ## Expected <= 0 Observed <= 0 1 <= Expected <= 2 1 <= Observed <= 2 ## G.1 67.68 63 69.96 75 ## F.1 63.66 70 72.31 62 ## E.1 72.43 74 67.60 64 ## Expected >= 3 Observed >= 3 Pearson.test Pearson.pvalue ## G.1 7.359 7 121.1 0.89736 ## F.1 9.031 13 172.4 0.04214 ## E.1 4.970 7 148.5 0.33771 ## statistic.X-squared p.value ## G.1 0.7043 0.7011 ## F.1 3.8479 0.1419 ## E.1 1.0554 0.5872  The count model gives an expected number of hurricanes each year. The expected is compared to the observed number as before. Results indicate that clustering is somewhat ameliorated by conditioning the rates on the covariates. In particular, the Pearson residual reduces to 172.4 with an increase in the corresponding p-value to 0.042. However, the p-value remains near .15 indicating the conditional model, while an improvement, fails to capture all the extra variation in Florida hurricane counts. A cluster model Having found evidence that Florida hurricanes arrive in clusters, you model this process. In the simplest case you assume the following. Each cluster has either one or two hurricanes and the annual cluster counts follows a Poisson distribution with a rate $$r$$. Note the difference. Earlier you assumed each hurricane was independent and annual hurricane counts followed a Poisson distribution. Further, you let $$p$$ be the probability that a cluster will have two hurricanes. Formally your model can be expressed as follows. Let $$N$$ be the number of clusters in a given year and $$X_i, i =1, \dots, N$$ be the number of hurricanes in each cluster minus one. Then the number of hurricanes in a given year is given by $$H=N+\sum{i=1}^N X_i$$. Conditional on $$N$$, $$M=\sum_{i=1}^N X_i$$ has a binomial distribution since the $$X_i$$'s are independent Bernoulli variables and $$p$$ is constant. That is, $$H=N+M$$, where the annual number of clusters $$N$$ has a Poisson distribution with cluster rate $$r$$, and $$M$$ has a binomial distribution with proportion $$p$$ and size $$N$$. Here the binomial distribution describes the number of occurrences of at least one hurricane in a sequence of $$N$$ independent years, with each year having a probability $$p$$ of observing at least one hurricane. The model has two parameters $$r$$ and $$p$$. A better parameterization is to use $$\lambda = r(1+p)$$ with $$p$$ to separate the hurricane frequency from the cluster probability. The parameters do not need to be fixed and can be functions of the covariates. When $$p=0$$, $$H$$ is Poisson, and when $$p=1$$, $$H/2$$ is Poisson, the dispersion is two, and the probability that $$H$$ is even is 1. You need a way to estimate $$r$$ and $$p$$. Parameter estimation Your goal is a hurricane count distribution for Florida. For that you need an estimate of the annual cluster rate ($$r$$) and the probability ($$p$$) that the cluster size is two. Continuing with the GLM approach you separately estimate the annual hurricane frequency, $$\lambda$$, and the annual cluster rate $$r$$. The ratio of these two parameters minus one is an estimate of the probability $$p$$. This is reasonable if $$p$$ does not vary much, since the annual hurricane count variance is proportional to the expected hurricane count [i.e., $$\mbox{var}(H) = r(1+3p) \propto r \propto E(H)$$]. You estimated the parameters of the annual count model using Poisson regression, which assumes that the variance of the count is, in fact, proportional to the expected count. Thus under the assumption that $$p$$ is constant, Poisson regression can be used for estimating $$\lambda$$ in the cluster model. As before, you regress the logarithm of the link function for the cluster rate onto the predictors NAO and SOI. The parameters of this annual cluster count model cannot be estimated directly, since the observed hurricane count does not furnish information about the number of clusters. Consider the observed set of annual Florida hurricane counts. Since the annual frequency is quite small, the majority of years have either no hurricanes or a single hurricane. You can create a reduced' data set by using an indictor of whether or not there was at least one hurricane. Formally let $$I_i=I(H_i > 0)=I(N_i > 0))$$, then $$I$$ is an indicator of the occurrence of a hurricane cluster for each year. You assume $$I$$ has a binomial distribution with size parameter of one and a proportion equal to $$\pi$$. This leads to a logistic regression model (see Chapter~\ref{chap:frequencymodels}) for $$I$$. Note that since $$\exp(-r)$$ is the probability of no clusters, the probability of a cluster $$\pi$$ is $$1 - \exp(-r)$$. Thus the cluster rate is $$r = -\log(1-\pi)$$. If you use a logarithmic link function on $$r$$, then $$\log(r) = \log(-\log(1-\pi)) = \mbox{cloglog}(\pi)$$, where cloglog is the complementary log-log function. Thus you model $$I$$ using the cloglog function to obtain $$r$$. Your cluster model is a combination of two models, one for the counts another for the clusters. You start by comparing fitted values from the count model with fitted values from the cluster model. Let $$H_i$$ be the hurricane count in year $$i$$ and $$\hat \lambda_i$$ and $$\hat r_i$$ be the fitted annual count and cluster rates, respectively. Then let $$\tau_0$$ be a test statistic given by $$\tau_0=\frac{1}{n}\sum_{i=1}^n (H_i-\hat{r_i}) = \frac{1}{n}\sum_{i=1}^n (\hat{\lambda}_i-\hat{r_i})$$. The value of $$\tau_0$$ is greater than one if there is clustering. You test the significance of $$\tau_0$$ by generating random samples of length $$n$$ from a Poisson distribution with rate $$\lambda_i$$ and computing $$\tau_j$$ for $$j =1, \ldots, N$$, where $$N$$ is the number of samples. A test of the null hypothesis that $$\tau_0 \le 0$$ is the proportion of simulated $$\tau$$'s that are at least as large as $$\tau_0$$. You do this with the testfits() function in the correlationfuns.R package by specifying the model formula, data, and number of random samples. tfF = testfits(F.1 ~ nao + soi, data = dat, N = 1000) tfF$test

## [1] 0.1043

tfF$testpvalue  ## [1] 0.036  For Florida hurricanes the test statistic $$\tau_0$$ has a value of 0.104 indicating some difference in count and cluster rates. The proportion of 1000 simulated $$\tau$$'s that are as least as large as this is 0.036 providing sufficient evidence to reject the no-cluster hypothesis. Repeating the simulation using Gulf Coast hurricanes tfG = testfits(G.1 ~ nao + soi, data = dat, N = 1000) tfG$test

## [1] -0.06174

tfGtestpvalue  ## [1] 0.772  you find that, in contrast to Florida, there is little evidence against the no-cluster hypothesis. A linear regression through the origin of the fitted count rate on the cluster rate under the assumption that $$p$$ is constant yields an estimate for $$1+p$$. You plot the annual count and cluster rates and draw the regression line using the plotfits() function. par(mfrow = c(1, 2), pty = "s") ptfF = plotfits(tfF) mtext("a", side = 3, line = 1, adj = 0, cex = 1.1) ptfG = plotfits(tfG) mtext("b", side = 3, line = 1, adj = 0, cex = 1.1)  The black line is the $$y=x$$ line and you expect cluster and hurricane rates to align along this axis if there is no clustering. The red line is the regression of the fitted hurricane rate onto the fitted cluster rate with the intercept set to zero. The slope of the line is an estimate of $$1+p$$. The regression slopes are printed by typing, coefficients(ptfF)  ## rate ## 1.138  coefficients(ptfG)  ## rate ## 0.9425  The slope is 1.14 for the Florida region giving 0.14 as an estimate for $$p$$ (probability the cluster will have two hurricanes). The regression slope is 0.94 for the Gulf coast region which you interpret as a lack of evidence for hurricane clusters. Your focus is now on Florida hurricanes only. You continue by looking at the coefficients from both models. Type summary(tfFfits$poisson)$coef

##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.26820    0.10533  -2.546  0.01089
## nao         -0.22579    0.09023  -2.502  0.01234
## soi          0.05618    0.03026   1.856  0.06340

summary(tfF$fits$binomial)$coef  ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -0.41950 0.13476 -3.1129 0.001852 ## nao -0.26861 0.12059 -2.2275 0.025917 ## soi 0.02123 0.04059 0.5232 0.600840  The output coefficient tables show that the NAO and SOI covariates are significant in the hurricane count model, but only the NAO is significant in the hurricane cluster model. The difference in coefficient values from the two models is an estimate of $$\log(1+p)$$, where again $$p$$ is the probability that a cluster will have two hurricanes. The difference in the NAO coefficient is 0.043 and the difference in the SOI coefficient is 0.035 indicating the NAO increases the probability of clustering more than ENSO. Lower values of the NAO lead to a larger rate increase for the Poisson model relative to the binomial model. Forecasts It is interesting to compare forecasts of the distribution of Florida hurricanes using a Poisson model and your cluster model. Here you set $$p=.138$$ for the cluster model. You can use the same two-component formulation for your Poisson model by setting $$p=0$$. You prepare your data using the lambdapclust() function as follows. ctsF = cts[, "F.1", drop = FALSE] pars = lambdapclust(prsp[, "F.1", drop = FALSE], p = 0.138) ny = nrow(ctsF) h = 0:5  Next you compute the expected number of years with h hurricanes from the cluster and Poisson models and tabulate the observed number of years. You combine them in a data object. eCl = sapply(h, function(x) sum(do.call("dclust", c(x = list(rep(x, ny)), pars)))) ePo = sapply(0:5, function(x) sum(dpois(x = rep(x, ny), lambda = prsp[, "F.1"]))) o = as.numeric(table(ctsF)) dat = rbind(o, eCl, ePo) names(dat) = 0:5  Finally you plot the observed versus the expected from the cluster and Poisson models using a bar plot where the bars are plotted side-by-side. barplot(dat, ylab = "Number of Years", xlab = "Number of Florida Hurricanes", names.arg = c(0:5), col = c("black", "red", "blue"), legend = c("Observed", "Cluster", "Poisson"), beside = TRUE)  The expected numbers are based on a cluster model ($$p$$ = 0.137) and on a Poisson model ($$p$$ = 0). The cluster model fits the observed counts better than does the Poisson model particularly at the low and high count years. Florida had hurricanes in only two of the 11 years from 2000 through 2010. But these two years featured seven hurricanes. Seasonal forecast models that predict U.S. hurricane activity assume a Poisson distribution. You show here this assumption applied to Florida hurricanes leads to a forecast that under predicts both the number of years without hurricanes and the number of years with three or more hurricanes (Jagger and Elsner 2012). The lack of fit in the forecast model arises due to clustering of hurricanes along this part of the coast. You demonstrate a temporal cluster model that assumes the rate of hurricane clusters follows a Poisson distribution with the size of the cluster limited to two hurricanes. The model fits the distribution of Florida hurricanes better than a Poisson model when both are conditioned on the NAO and SOI. Spatial Clusters Is there a tendency for hurricanes to cluster in space? More specifically, given that a hurricane originates at a particular location is it more (or less) likely that another hurricanes will form in the same vicinity? This is a problem for spatial point pattern analysis. Here you consider models for analyzing and modeling events across space. An event is the occurrence of some phenomenon of interest. For example, an event can be a hurricane's origin or its lifetime maximum intensity. An event location is the spatial location of the event. For example, the latitude and longitude of the maximum intensity event. A point is any location where an event could occur. A spatial point pattern is a collection of events and event locations together with spatial domain. The spatial domain is defined as the region of interest over which events tend to occur. For example, the North Atlantic basin is the spatial domain for hurricanes. To define spatial clustering, it helps to define its absence. A spatial point pattern is defined as random if an event is equally likely to occur at any point within the spatial domain. A spatial point pattern is said to be clustered if given an event at some location it is more likely than random that another event will occur nearby. Regularity is the opposite; given an event, if it is less likely than random that another event will occur nearby, then the spatial point pattern is regular. Said another way, complete spatial randomness (CSR) defines a situation where an event is equally likely to occur at any location within the study area regardless of the locations of other events. A realization is a collection of spatial point patterns generated under a spatial point process model. To illustrate, consider four point patterns each consisting of events inside the unit plane. You generate the event locations and plot them by typing par(mfrow = c(2, 2), mex = 0.9, pty = "s") for (i in 1:4) { u = runif(n = 30, min = 0, max = 1) v = runif(n = 30, min = 0, max = 1) plot(u, v, pch = 19, xlim = c(0, 1), ylim = c(0, 1)) }  The patterns illustrate that some amount of clustering occurs by chance. The spatstat package (Baddeley and Turner 2005) contains a large number of functions for analyzing and modeling point pattern data. To make the functions available and obtain a citation type require(spatstat) citation(package = "spatstat")  ## ## To cite 'spatstat' in publications, please use: ## ## A. Baddeley and R. Turner (2005). Spatstat: an R package for ## analyzing spatial point patterns. Journal of Statistical ## Software 12 (6), 1-42. ISSN: 1548-7660. URL: www.jstatsoft.org ## ## A BibTeX entry for LaTeX users is ## ## @Article{, ## title = {Spatstat: an {R} package for analyzing spatial point patterns}, ## author = {Adrian Baddeley and Rolf Turner}, ## journal = {Journal of Statistical Software}, ## volume = {12}, ## number = {6}, ## pages = {1--42}, ## year = {2005}, ## note = {{ISSN} 1548-7660}, ## url = {www.jstatsoft.org}, ## } ## ## For references to additional papers describing the use of ## 'spatstat' see 'help("spatstat").'  Complete spatial randomness lies on a continuum between clustered and regular spatial point patterns. You use simulation functions to compare the CSR plots with regular and cluster patterns. par(mfrow = c(2, 2), mex = 0.5) plot(rMaternI(kappa = 80, r = 0.05), pch = 19, main = "") plot(rMaternI(kappa = 80, r = 0.05), pch = 19, main = "") plot(rMatClust(kappa = 20, r = 0.15, mu = 4), pch = 19, main = "") plot(rMatClust(kappa = 20, r = 0.15, mu = 4), pch = 19, main = "")  Here you can see two realizations of patterns more regular than CSR (top row), and two of patterns more clustered than CSR (bottom row). The simulations are made using a spatial point pattern model. Spatial scale plays a role. A set of events can indicate a regular spatial pattern on one scale but a clustered pattern on anther. Point processes Given a set of spatial events (spatial point pattern) your goal is to assess evidence for clustering (or regularity). Spatial cluster detection methods are based on statistical models for spatial point processes. The random variable in these models is the event locations. Here we provide some definitions useful for understanding point process models. Details on the underlying theory are available in Ripley (1981), Cressie (1993), and Diggle (2003). A process generating events is stationary when it is invariant to translation across the spatial domain. This means that the relationship between two events depends only on their positions relative to one another and not on the event locations themselves. This relative position refers to (lag) distance and orientation between the events. A process is isotropic when orientation does not matter. Given a single realization, the assumptions of stationarity and isotropy allow for replication. Two pairs of events in the same realization of a stationary process that are separated by the same distance should have the same relatedness. This allows you to use relatedness information from one part of the domain as a replicate for relatedness from another part of the domain. The assumptions of stationarity and isotropy let you begin and can be relaxed later. The Poisson distribution is used to define a model for CSR. A spatial point process is said to be homogeneous Poisson under the following two criteria: (1) The number of events occurring within a finite region $$A$$ is a random variable following a Poisson distribution with mean $$\lambda \times A$$ for some positive constant $$\lambda$$, with $$|A|$$ denoting the area of $$A$$ and (2) Given the total number of events $$N$$, their locations are an independent random sample of points where each point is equally likely to be picked as an event location. The first criteria is about the density of the spatial process. For a given domain it answers the question, how many events? It is the number of events divided by the domain area. The density is an estimate of the rate parameter of the Poisson distribution. Note that in spatial statistics this is often called the intensity. Here we use the term density instead so as not to confuse the spatial rate parameter with hurricane strength. The second criteria is about homogeneity. Events are scattered throughout the domain and are neither clustered or regular. It helps to consider how to create a homgeneous Poisson point pattern. The procedure follows straight from the definition. Step 1: Generate the number of events using a Poisson distribution with mean equal to $$\lambda$$. Step 2: Place the events inside the domain using a uniform distribution for both spatial coordinates. For example, let area of the domain be one and the density of events be 200, then you type lambda = 200 N = rpois(1, lambda) x = runif(N) y = runif(N) plot(x, y, pch = 19)  Note that if your domain is not a rectangle, you can circumscribe it within a rectangle, then reject events sampled from inside the rectangle that fall outside the domain. By definition these point patterns are CSR. However, as noted above you are typically in the opposite position. You observe a set of events and want to know if they are regular or clustered. Your null hypothesis is CSR and you need a test statistic that will guide your inference. The above demonstrates the null models are easy to construct, so you can use Monte Carlo methods. In some cases the homogeneous Poisson model is too restrictive. Consider hurricane genesis as an event process. Event locations may be random but the probability of an event is higher away from the equator. The constant risk hypothesis undergirding the homogeneous Poisson point pattern model requires a generalization to include a spatially varying density function. To do this you define the density $$\lambda(s)$$, where $$s$$ denotes spatial points. This is called a inhomogeneous Poisson model and it is analogous to the universal kriging model used on field data. Inhomogeneity as defined by a spatially varying density implies non-stationarity as the number of events depends on location. Spatial density Density is a first-order property of a spatial point pattern. That is; the density function estimates the mean number of events at any point in your domain. Events are independent of one another, but event clusters appear because of the varying density. Given an observed set of events, how do you estimate $$\lambda(s)$$? One approach is to use kernel densities. Consider the set of hurricanes over the period 1944–2000 that were designated tropical only and baroclinically enhanced. Input these data and create a spatial points data frame of the genesis locations by typing bh = read.csv("bi.csv", header = TRUE) require(sp) coordinates(bh) = c("FirstLon", "FirstLat") ll = "+proj=longlat +ellps=WGS84" proj4string(bh) = CRS(ll)  Next convert the geographic coordinates using the Lambert conformal conic projection true at parallels 10 and 40 $$^\circ$$ N and a center longitude of 60 $$^\circ$$ W. First save the reference system as a CRS object then use the spTransform() function from the rgdal package. lcc = "+proj=lcc +lat_1=40 +lat_2=10 +lon_0=-60" require(rgdal) bht = spTransform(bh, CRS(lcc))  The spatial distance unit is meters. Use the map() (maps) and the map2SpatialLines() (maptools) to obtain country borders by typing require(maps) require(maptools) brd = map("world", xlim = c(-100, -30), ylim = c(10, 48), plot = FALSE) brd_ll = map2SpatialLines(brd, proj4string = CRS(ll)) brd_lcc = spTransform(brd_ll, CRS(lcc))  You use the same coordinate transformation on the map borders as you do on the cyclone locations. Next you need to convert your S4 class objects (bh and clp) into S3 class objects for use with the functions in the spatstat package. require(spatstat) bhp = as.ppp(bht) clpp = as.psp(brd_lcc)  The spatial point pattern object bhp contains marks. A mark is attribute information at each event location. The marks are the columns from the original data that were not used for location. You are interested only in the hurricane type (either tropical only or baroclinic) so you reset the marks accordingly by typing marks(bhp) = bht$Type == 0


You summarize the object with the summary() method. The summary includes an average density over the spatial domain. The density is per unit area.

Your native length unit is meter from the Lambert planar projection so your density is per square meter. You retrieve the average density in units of per (1000 km)$$^2$$ by typing

summary(bhp)$intensity * 1e+12  ## [1] 11.42  Thus, on average, each point in your spatial domain has slightly more than ten hurricane origins per one thousand squared kilometers. This is the mean spatial density. Some areas have more or less than the mean so you would like an estimate of $$\lambda(s)$$. You do this using the density() method, which computes a kernel smoothed density function from your point pattern object. By default the kernel is gaussian with a bandwidth equal to the standard deviation of the isotropic kernel. The default output is a pixel image containing local density estimates. Here again you convert the density values to have units of 1000 squared kilometers. den = density(bhp) den$v = den$v * 1e+12  You use the plot() method first to plot the density image, then the country border, and finally the event locations. plot(den) plot(unmark(clpp), add = TRUE) plot(unmark(bhp), pch = 19, cex = 0.3, add = TRUE)  Event density maps are split by tropical only and baroclinic hurricane genesis. Here we convert the im object to a SpatialGridDataFrame object and used the spplot() method. to = bhp[bhp$marks == TRUE]
text(ca$centers, pos = c(4, 1), labels = c("Cluster 2", "Cluster 1"))  The next bit of output tags each object with cluster membership. Here you see the first two objects belong to cluster 2 and the next three belong to cluster 1. The cluster number keeps track of distinct clusters but the numerical order is irrelevant. The within-cluster sum of squares is the sum of the distances between each object and the cluster centroid for which it is a member. From the plot you can see that the centroids (stars) minimize the within cluster distances while maximizing the between cluster distances. The function pam() is similar but uses medoids rather than centroids. A medoid is an multidimensional center based on medians. The method accepts a dissimilarity matrix, tends to be more robust (converges to the same result), and provides for additional graphical displays from the cluster package. Track clusters Your interest is to group hurricanes according to common track features. These features include location of hurricane origin, location of lifetime maximum intensity, location of final hurricane intensity, and lifetime maximum intensity. The three location features are further subdivided into latitude and longitude making a total of seven attribute (feature) variables. Note that location is treated here as an attribute rather than as a spatial coordinate. You first create a data frame containing only the attributes you wish to cluster. Here you use the cyclones since 1950. load("best.use.RData") best = subset(best.use, Yr >= 1950)  You split the data frame into separate cyclone tracks using Sid with each element in the list as a separate track. You are interested in identifying features associated with each track. best.split = split(best, best$Sid)


You then assemble a data frame using only the attributes to be clustered.

best.c = t(sapply(best.split, function(x) {
x1 = unlist(x[1, c("lon", "lat")])
x2 = unlist(x[nrow(x), c("lon", "lat")])
x3 = max(x$WmaxS) x4 = unlist(x[rev(which(x3 == x$WmaxS))[1], c("lon", "lat")])
return(c(x1, x2, x3, x4))
}))
best.c = data.frame(best.c)
colnames(best.c) = c("FirstLon", "FirstLat", "LastLon", "LastLat", "WmaxS",
"MaxLon", "MaxLat")


The data frame contains 7 features from 667 cyclones.

You check the feature variances by applying the var() function on the columns of the data frame using the sapply() function.

sapply(best.c, var)

## FirstLon FirstLat  LastLon  LastLat    WmaxS   MaxLon   MaxLat
##   499.21    57.04   710.41   164.82   870.69   356.72    75.90


Variances range from a minimum of 57 degrees squared for the latitude of origin feature to a maximum of 870.7 knots squared for the maximum intensity feature.

Because of the large range in variances you scale the features before performing cluster analysis. This is needed if your intent is to have each feature have the same influence on the clustering.

best.cs = scale(best.c)
m = attr(best.cs, "scaled:center")
s = attr(best.cs, "scaled:scale")


By default the function scale() centers and scales the columns of your numeric data frame. The center and scale values are saved as attributes in the new data frame. Here you save them to rescale the centroids after the cluster analysis.

You perform a $$k$$-means cluster analysis setting the number of clusters to three by typing

k = 3
ct = kmeans(best.cs, centers = k)
summary(ct)

##              Length Class  Mode
## cluster      667    -none- numeric
## centers       21    -none- numeric
## totss          1    -none- numeric
## withinss       3    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           3    -none- numeric


The output is a list of length seven containing the cluster vector, cluster means (centroids), the total sum of squares, the within cluster sum of squares by cluster, the total within sum of squares, the between sum of squares, and the size of each cluster.

Your cluster means are scaled so they are not readily meaningful. The cluster vector gives the membership of each hurricane in order as they appear in your data set. The ratio of the between sum of squares to the total sum of squares is 44.7%.

This ratio will increase with the number of clusters, but at the expense of having clusters that are not physically interpretable. With four clusters, the increase in this ratio is smaller than the increase going from two to three clusters. So you are content with your three-cluster solution.

Track plots

Since six of the seven features are spatial coordinates it is tempting to plot the centroids on a map and connect them with a track line. This would be a mistake as there is not a geographic representation of your feature clusters. Instead, you plot examples of cyclones that resemble each cluster.

First, you add cluster membership and distance to your original data frame then split the data by cluster member.

ctrs = ct$center[ct$cluster, ]
cln = ct$cluster dist = rowMeans((best.cs - ctrs)^2) id = 1:length(dist) best.c = data.frame(best.c, id = id, dist = dist, cln = cln) best.c.split = split(best.c, best.c$cln)


Next you subset your cluster data based on the tracks that come closest to the cluster centroids. This closeness is in feature space that includes latitude and longitude but also intensity. Here you choose 9 tracks for each centroid.

te = 9
bestid = unlist(lapply(best.c.split, function(x) x$id[order(x$dist)[1:te]]))
cinfo = subset(best.c, id %in% bestid)


Finally you plot the tracks on a map. This requires a few steps to make the plot easier to read. Begin by setting the bounding box using latitude and longitude of your cluster data and renaming your objects.

cyclones = best.split
uselat = range(unlist(lapply(cyclones[cinfo$id], function(x) x$lat)))
uselon = range(unlist(lapply(cyclones[cinfo$id], function(x) x$lon)))


Next, order the clusters by number and distance and set the colors for plotting. Use the brewer.pal() function in the RColorBrewer package (Neuwirth 2011). You use a single hue sequential color ramp that allows you to highlight the cyclone tracks that are closest to the feature clusters.

cinfo = cinfo[order(cinfo$cln, cinfo$dist), ]
require(RColorBrewer)
blues = brewer.pal(te, "Blues")
greens = brewer.pal(te, "Greens")
reds = brewer.pal(te, "Reds")
cinfo$colors = c(rev(blues), rev(greens), rev(reds))  Next, reorder the tracks for plotting them as a weave with the tracks farthest from the centroids plotted first. cinfo = cinfo[order(-cinfo$dist, cinfo$cln), ]  Finally, plot the tracks and add world and country borders. require(maps) plot(uselon, uselat, type = "n", xaxt = "n", yaxt = "n", ylab = "", xlab = "") map("world", add = TRUE) map("usa", add = TRUE) for (i in 1:nrow(cinfo)) { cid = cinfo$id[i]
cyclone = cyclones[[cid]]
n = length(cyclone$lon) lines(cyclone$lon, cyclone$lat, lwd = 2, col = cinfo$colors[i])
arrows(cyclone$lon[n - 1], cyclone$lat[n - 1], cyclone$lon[n], cyclone$lat[n],
lwd = 2, length = 0.1, col = cinfo\$colors[i])
}


Track color is based on attribute proximity to the cluster centroid using a color saturation that decreases with distance.

The analysis splits the cyclones into a group over the Gulf of Mexico, a group at high latitudes, and a group that begins at low latitudes but ends at high latitudes generally east of the United States. Some of the cyclones that begin at high latitude are baroclinic.

Cluster membership can change depending on the initial random centroids particularly for tracks that are farthest from a centroid. The two- and three-cluster tracks are the most stable. The approach can be extended to include other track features including velocity, acceleration, and curvature representing more of the space and time behavior of hurricanes.

Model-based clustering, where the observations are assumed to be quantified by a finite mixture of probability distributions, is an attractive alternative if you want to make inferences about future hurricane activity.

In the end it's worthwhile to keep in mind the advice of Bill Venables and Brian Ripley. You should not assume cluster analysis is the best way to discover interesting groups. Indeed, visualization methods are often more effective in this task.