\chapter{Cluster Models} \label{chap:clustermodels} \SweaveOpts{keep.source=TRUE, pdf=FALSE, prefix.string=Chap11, grdevice=tikz.Swd} <>= options(digits=3, show.signif.stars=FALSE, width=53) rm(list=ls()) require(tikzDevice) source("../SweaveTikZ.R") @ %data files used: read.table("gridboxes.txt"); read.csv("bi.csv", T); load("annual.RData"); load("best.use.RData") %required packages: maps, spatstat, sp, rgdal, maptools, cluster, xtable, RColorBrewer %source code: source("correlationfuns.R") \begin{quote} ``There are in fact two things, science and opinion; the former begets knowledge, the latter, ignorance.'' \end{quote} \indent---Hippocrates\\ A cluster is a group of the same or similar events close together. Clusters arise in hurricane origin locations, tracks, and landfalls. In this chapter we look at how to analyze and model clusters. We divide the chapter into time, space, and featuring clustering. Feature clustering is perhaps the best known among climatologists. We begin by showing you how to detect and model time clusters. \section{Time Clusters} Hurricanes form over certain regions of the ocean. Consecutive hurricanes from the same area often take similar paths. This grouping, or clustering, increases the potential for multiple landfalls above what you expect from random events. A statistical model for landfall probability captures clustering through covariates like the North Atlantic Oscillation (NAO), which relates a steering mechanism (position and strength of the subtropical High) to coastal hurricane activity. But there could be additional serial correlation not related to the covariates. A model that does not account for this extra variation will underestimate the potential for multiple hits in a season. Following \cite{JaggerElsner2006} you consider three coastal regions including the Gulf Coast, Florida, and the East Coast (Fig.~\ref{chap06:fig:coastalregions}). Regions are large enough to capture enough hurricanes, but not too large as to include many non-coastal strikes. Here you use hourly position and intensity data described in Chapter~\ref{chap:datasets}. For each tropical cyclone, you note its wind speed maximum within each region. If the maximum wind exceeds 33 m~s$^{-1}$ then you count it as a hurricane for the region. A tropical cyclone that affects more than one region at hurricane intensity is counted in each region. Because of this, the sum of the regional counts is larger than the total count. Begin by loading {\bf annual.RData}. These data were assembled in Chapter~\ref{chap:datasets}. Subset the data for years starting with 1866. <>= load("annual.RData") dat = subset(annual, Year >= 1866) @ The covariate Southern Oscillation Index (SOI) data begin in 1866. Next, extract all hurricane counts for the Gulf coast, Florida, and East coast regions. <>= cts = dat[, c("G.1", "F.1", "E.1")] @ \subsection{Cluster detection} You start by comparing the observed with the expected number of years for the two groups of hurricane counts. The groups include years with no hurricanes and years with three or more. The expected number is from a Poisson distribution with a constant rate. The idea is that for regions that show a cluster of hurricanes, the observed number of years with no hurricanes and years with three or more hurricanes should be greater than the corresponding expected number. Said another way, a Poisson model with a hurricane rate estimated from counts over all years in regions with hurricane clustering will under estimate the number of years with no hurricanes and years with many hurricanes. For example, you find the observed number of years without a Florida hurricane and the number of years with more than two hurricanes by typing <>= obs = table(cut(cts$F.1, breaks=c(-.5, .5, 2.5, Inf))) obs @ And the expected numbers for these three groups by typing <>= n = length(cts$F.1) mu = mean(cts$F.1) exp = n * diff(ppois(c(-Inf, 0, 2, Inf), lambda=mu)) exp @ The observed and expected counts for the three regions are given in Table~\ref{tab:observedvsexpectedcounts}. <>= source("correlationfuns.R") mu = colMeans(cts) rt = regionTable(cts, mu) @ In the Gulf and East coast regions the observed number of years are relatively close to the expected number of years in each of the groups. In contrast, in the Florida region you see the observed number of years exceeds the expected number of years for the no-hurricane and the three-or-more hurricanes groups. \begin{table} \caption{\label{tab:observedvsexpectedcounts} Observed and expected number of hurricane years by count groups.} \begin{center} \begin{tabular}{lcccc} \hline Region & O($= 0$) & E($= 0$) & O($\ge 3$)& E($\ge 3$) \\ \hline Gulf Coast & 63 & 66.1 & 7 & 6.6 \\ Florida & 70 & 61.7 & 13 & 8.1 \\ East Coast & 74 & 72.3 & 7 & 4.9 \\ \hline \end{tabular} \end{center} \end{table} The difference between the observed and expected numbers in each region is used to assess the statistical significance of the clustering. This is done using Pearson residuals and $\chi^2$ statistic. The Pearson residual is the difference between the observed count and expected rate divided by the square root of the variance. The $p$-value is evidence in support of the null hypothesis of no clustering as indicated by no difference between the observed and expected numbers in each group. For example, to obtain the $\chi^2$ statistic, type <>= xis = sum((obs - exp)^2 / exp) xis @ The $p$-value as evidence in support of the null hypothesis is given by <>= pchisq(q=xis, df=2, lower.tail=FALSE) @ where \verb@df@ is the degrees of freedom equal to the number of groups minus one. The $\chi^2$ and Pearson statistics for the three regions are shown in Table~\ref{tab:obsvsexpstats}. \begin{table} \caption{\label{tab:obsvsexpstats} Observed versus expected statistics. The Pearson and $\chi^2$ test statistics along with the corresponding $p$-values are given for each coastal region.} \begin{center} \begin{tabular}{lcccc} \hline Region & Pearson &$p$-value & $\chi^2$ &$p$-value \\ \hline Gulf Coast & 135 & 0.6858 & 0.264 & 0.894 \\ Florida & 187 & 0.0092 & 6.475 & 0.039 \\ East Coast & 150 & 0.3440 & 1.172 & 0.557 \\ \hline \end{tabular} \end{center} \end{table} The $p$-values for the Gulf and East coasts are greater than .05 indicating little support for the cluster hypothesis. In contrast the $p$-value for the Florida region is \Sexpr{round(rt[2,8],3)} using the Pearson residuals and \Sexpr{round(rt[2,10],3)} using the $\chi^2$ statistic. These values provide evidence the hurricane occurrences in Florida are grouped in time. \subsection{Conditional counts} What might be causing this grouping? The extra variation in annual hurricane counts might be due to variation in hurricane rates. You examine this possibility with a Poisson regression model (see Chapter~\ref{chap:frequencymodels}). The model includes an index for the North Atlantic Oscillation (NAO) and the Southern Oscillation (SOI) after \cite{ElsnerJagger2006}. This is a generalized linear model (GLM) approach using the Poisson family that includes a logarithmic link function for the rate. The annual hurricane count model is \begin{eqnarray} H_i &\sim& \mbox{dpois} (\lambda_i) \\ \nonumber \log(\lambda_i) &=& \beta_0+\beta_{\mbox{soi}} \mbox{ SOI}_{i}+\beta_{\mbox{nao}} \mbox{ NAO}_{i} + \varepsilon_i \end{eqnarray} where $H_i$ is the hurricane count in year $i$ simulated ($\sim$) from a Poisson distribution with a rate $\lambda_i$ that depends on the year $i$. The logarithm of the rate depends in a linear way on the SOI and NAO covariates. The code to fit the model, determine the expected count, and table the observed versus expected counts for each region is given by <>= pfts = list(G.1 = glm(G.1 ~ nao + soi, family="poisson", data=dat), F.1 = glm(F.1 ~ nao + soi, family="poisson", data=dat), E.1 = glm(E.1 ~ nao + soi, family="poisson", data=dat)) prsp = sapply(pfts, fitted, type="response") rt = regionTable(cts, prsp, df=3) @ The count model gives an expected number of hurricanes each year. The expected is compared to the observed number as before. Results indicate that clustering is somewhat ameliorated by conditioning the rates on the covariates. In particular, the Pearson residual reduces to \Sexpr{round(rt[2,7],1)} with an increase in the corresponding $p$-value to \Sexpr{round(rt[2,8],3)}. However, the $p$-value remains near .15 indicating the conditional model, while an improvement, fails to capture all the extra variation in Florida hurricane counts. \subsection{Cluster model} Having found evidence that Florida hurricanes arrive in clusters, you will model this process. In the simplest case you assume the following. Each cluster has either one or two hurricanes and the annual cluster counts follows a Poisson distribution with a rate $r$. Note the difference. Earlier you assumed each hurricane was independent and annual hurricane counts followed a Poisson distribution. Further, you let $p$ be the probability that a cluster will have two hurricanes. Formally your model can be expressed as follows. Let $N$ be the number of clusters in a given year and $X_i, i =1, \ldots, N$ be the number of hurricanes in each cluster minus one. Then the number of hurricanes in a given year is given by $H=N+\sum_{i=1}^N X_i$. Conditional on $N$, $M=\sum_{i=1}^N X_i$ has a binomial distribution since the $X_i$'s are independent Bernoulli variables and $p$ is constant. That is, $H=N+M$, where the annual number of clusters $N$ has a Poisson distribution with cluster rate $r$, and $M$ has a binomial distribution with proportion $p$ and size $N$. Here the binomial distribution describes the number of occurrences of at least one hurricane in a sequence of $N$ independent years, with each year having a probability $p$ of observing at least one hurricane. In summary your cluster model has the following properties. \begin{enumerate} \item The expected number of hurricanes $E(H) = r(1+p)$. \item The variance of $H$ is given by \begin{eqnarray*} \mbox{var}(H) &=& E(\mbox{var}(H|N))+ \mbox{var}(E(H|N)) \\ &=& E(N(p(1-p)))+\mbox{var}((1+p)N) \\ &=& rp(1-p) + r(1+p)(1+p) =r(1+3p) \end{eqnarray*} \item The dispersion of $H$ is given by $\mbox{var}(H)/E(H) = \phi = (1+3p)/(1+p)$, which is independent of cluster rate. Solving for $p$ you find $p=(\phi-1)/(3-\phi)$. \item The probability mass function for the number of hurricanes, $H$, is \begin{eqnarray*} P(H=k|r, p) &=& \sum_{{i=0}}^{\lfloor i/2\rfloor} \mbox{ dpois}(k-i,r) \mbox{ dbinom}(i, k-i,p) ; k=0,1,... \\ P(H=0|r,p) &=& e^{-r}\mbox{dpois}(k-i,r) \\ &=& e^{-r}\frac{r^{k-i}}{(k-i)!}\mbox{dbinom}(i,k-i,p) \\ &=& \frac{k-i}{i} p^i(1-p)^{k-2i} \\ \end{eqnarray*} \item The model has two parameters $r$ and $p$. A better parameterization is to use $\lambda = r(1+p)$ with $p$ to separate the hurricane frequency from the cluster probability. The parameters do not need to be fixed and can be functions of the covariates. \item When $p=0$, $H$ is Poisson, and when $p=1$, $H/2$ is Poisson, the dispersion is two, and the probability that $H$ is even is 1. \end{enumerate} You need a way to estimate $r$ and $p$. \subsection{Parameter estimation} Your goal is a hurricane count distribution for the Florida region. For that you need an estimate of the annual cluster rate ($r$) and the probability ($p$) that the cluster size is two. Continuing with the GLM approach you separately estimate the annual hurricane frequency, $\lambda$, and the annual cluster rate $r$. The ratio of these two parameters minus one is an estimate of the probability $p$. This is reasonable if $p$ does not vary much, since the annual hurricane count variance is proportional to the expected hurricane count [i.e., $\mbox{var}(H) = r(1+3p) \propto r \propto E(H)$]. You estimated the parameters of the annual count model using Poisson regression, which assumes that the variance of the count is, in fact, proportional to the expected count. Thus under the assumption that $p$ is constant, Poisson regression can be used for estimating $\lambda$ in the cluster model. As before, you regress the logarithm of the link function for the cluster rate onto the predictors NAO and SOI. The model is given by \begin{eqnarray} N_i &\sim& \mbox{dpois} (r_i) \\ \nonumber \log(r_i) &=& \alpha_0+\alpha_{1} \mbox{ SOI}_{i}+\alpha_{2} \mbox{ NAO}_{i} + \varepsilon_i \end{eqnarray} The parameters of this annual cluster count model cannot be estimated directly, since the observed hurricane count does not furnish information about the number of clusters. Consider the observed set of annual Florida hurricane counts. Since the annual frequency is quite small, the majority of years have either no hurricanes or a single hurricane. You can create a `reduced' data set by using an indictor of whether or not there was at least one hurricane. Formally let $I_i=I(H_i > 0)=I(N_i > 0))$, then $I$ is an indicator of the occurrence of a hurricane cluster for each year. You assume $I$ has a binomial distribution with size parameter of one and a proportion equal to $\pi$. This leads to a logistic regression model (see Chapter~\ref{chap:frequencymodels}) for $I$. Note that since $\exp(-r)$ is the probability of no clusters, the probability of a cluster $\pi$ is $1 - \exp(-r)$. Thus the cluster rate is $r = -\log(1-\pi)$. If you use a logarithmic link function on $r$, then $\log(r) = \log(-\log(1-\pi)) = \mbox{cloglog}(\pi)$, where cloglog is the complementary log-log function. Thus you model $I$ using the cloglog function to obtain $r$. Your cluster model is a combination of two models, one for the counts another for the clusters. Formally, it is given by \begin{eqnarray} I_i &\sim& \mbox{dbern} (\pi_i) \\ \nonumber \mbox{cloglog}(\pi_i) &=& \alpha_0+\alpha_1 \mbox{ SOI}_{i}+ \alpha_1 \mbox{ NAO}_{i} + \varepsilon_i \end{eqnarray} where \verb@dbern@ is the Bernoulli distribution with mean $\pi$. The covariates are the same as those used in the cluster count model. Given these equations you have the following relationships for $r$ and $p$. \begin{eqnarray} \log(\hat r (1+\hat p)) &=& \hat \beta_0 + \hat \beta_1 \mbox{ SOI} + \hat \beta_2 \mbox{ NAO} \label{eq:count} \\ \log(\hat r) &=& \hat \alpha_0 + \hat \alpha_1 \mbox{ SOI} + \hat \alpha_2 \mbox{ NAO} \label{eq:randclust} \end{eqnarray} By subtracting the coefficients in Eq.~\ref{eq:randclust} for the annual cluster count model from the those in Eq.~\ref{eq:count} for the annual hurricane count model, you have a regression model for the probabilities given by \begin{equation} \label{eq:randp} \log(1+\hat p)=\hat \beta_0 - \hat \alpha_0 +(\hat \beta_1 - \hat \alpha_1) \mbox{ SOI}+ (\hat \beta_2 - \hat \alpha_2) \mbox{ NAO} \end{equation} \subsection{Model diagnostics} You start by comparing fitted values from the count model with fitted values from the cluster model. Let $H_i$ be the hurricane count in year $i$ and $\hat \lambda_i$ and $\hat r_i$ be the fitted annual count and cluster rates, respectively. Then let $\tau_0$ be a test statistic given by \begin{equation} \tau_0=\frac{1}{n}\sum_{i=1}^n (H_i-\hat{r_i}) = \frac{1}{n}\sum_{i=1}^n (\hat{\lambda}_i-\hat{r_i}). \end{equation} The value of $\tau_0$ is greater than one if there is clustering. You test the significance of $\tau_0$ by generating random samples of length $n$ from a Poisson distribution with rate $\lambda_i$ and computing $\tau_j$ for $j =1, \ldots, N$, where $N$ is the number of samples. A test of the null hypothesis that $\tau_0 \le 0$ is the proportion of simulated $\tau$'s that are at least as large as $\tau_0$. You do this with the \verb@testfits@ function in the {\bf correlationfuns.R} package by specifying the model formula, data, and number of random samples. <>= source("correlationfuns.R") tfF = testfits(F.1 ~ nao + soi, data=dat, N=1000) tfF$test; tfF$testpvalue @ For Florida hurricanes the test statistic $\tau_0$ has a value of \Sexpr{round(tfF$test,3)} indicating some difference in count and cluster rates. The proportion of 1000 simulated $\tau$'s that are as least as large as this is \Sexpr{round(tfF$testpvalue,3)} providing sufficient evidence to reject the no-cluster hypothesis. Repeating the simulation using Gulf Coast hurricanes <>= tfG = testfits(G.1 ~ nao + soi, data=dat, N=1000) tfG$test; tfG$testpvalue @ you find that, in contrast to Florida, there is little evidence against the no-cluster hypothesis. A linear regression through the origin of the fitted count rate on the cluster rate under the assumption that $p$ is constant yields an estimate for $1+p$. You plot the annual count and cluster rates and draw the regression line using the \verb@plotfits@ function. <>= par(mfrow=c(1, 2), pty="s") ptfF = plotfits(tfF) mtext("a", side=3, line=1, adj=0, cex=1.1) ptfG = plotfits(tfG) mtext("b", side=3, line=1, adj=0, cex=1.1) @ \begin{figure} \centering <>= par(mfrow=c(1, 2), pty="s", mgp=c(2, .4, 0), tcl=-.3) ptfF = plotfits(tfF) mtext("a", side=3, line=1, adj=0, cex=1.1) ptfG = plotfits(tfG) mtext("b", side=3, line=1, adj=0, cex=1.1) @ \vspace{-1cm} \caption{Count versus cluster rates for (a) Florida and (b) Gulf coast.} \label{chap11:fig:countvsclusterrates} \end{figure} The regressions are shown in Fig.~\ref{chap11:fig:countvsclusterrates} for Florida and the Gulf coast. The black line is the $y=x$ line and you expect cluster and hurricane rates to align along this axis if there is no clustering. The red line is the regression of the fitted hurricane rate onto the fitted cluster rate with the intercept set to zero. The slope of the line is an estimate of $1+p$. The regression slopes are printed by typing, <>= coefficients(ptfF); coefficients(ptfG) @ The slope is \Sexpr{round(as.numeric(coefficients(ptfF)),2)} for the Florida region giving \Sexpr{round(as.numeric(coefficients(ptfF))-1,2)} as an estimate for $p$ (probability the cluster will have two hurricanes). The regression slope is \Sexpr{round(as.numeric(coefficients(ptfG)),2)} for the Gulf coast region which you interpret as a lack of evidence for hurricane clusters. Your focus is now on Florida hurricanes only. You continue by looking at the coefficients from both models. Type <>= summary(tfF$fits$poisson)$coef summary(tfF$fits$binomial)$coef @ <>= require(xtable) tbl = xtable(summary(tfF$fits$poisson)$coef, label='tab:coefcountmodel', caption='Coefficients of the count rate model.') print(tbl, math.style.negative=TRUE, caption.placement="top") @ <>= tbl = xtable(summary(tfF$fits$binomial)$coef, label='tab:coefclustermodel', caption="Coefficients of the cluster rate model.") print(tbl, math.style.negative=TRUE, caption.placement="top") @ The output coefficient tables (Tables~\ref{tab:coefcountmodel} and \ref{tab:coefclustermodel}) show that the NAO and SOI covariates are significant in the hurricane count model, but only the NAO is significant in the hurricane cluster model. The difference in coefficient values from the two models is an estimate of $\log(1+p)$, where again $p$ is the probability that a cluster will have two hurricanes. The difference in the NAO coefficient is \Sexpr{round(summary(tfF$fits$poisson)$coef[2,1]-summary(tfF$fits$binomial)$coef[2,1],3)} and the difference in the SOI coefficient is \Sexpr{round(summary(tfF$fits$poisson)$coef[3,1]-summary(tfF$fits$binomial)$coef[3,1],3)} indicating the NAO increases the probability of clustering more than ENSO. Lower values of the NAO lead to a larger rate increase for the Poisson model relative to the binomial model. \subsection{Forecasts} It is interesting to compare forecasts of the distribution of Florida hurricanes using a Poisson model and your cluster model. Here you set $p=.138$ for the cluster model. You can use the same two-component formulation for your Poisson model by setting $p=0$. You prepare your data using the \verb@lambdapclust@ function as follows. <>= ctsF = cts[, "F.1", drop=FALSE] pars = lambdapclust(prsp[, "F.1", drop=FALSE], p=.138) ny = nrow(ctsF) h = 0:5 @ Next you compute the expected number of years with \verb@h@ hurricanes from the cluster and Poisson models and tabulate the observed number of years. You combine them in a data object. <>= eCl = sapply(h, function(x) sum(do.call('dclust', c(x=list(rep(x, ny)), pars)))) ePo = sapply(0:5, function(x) sum(dpois(x=rep(x,ny), lambda=prsp[, "F.1"]))) o = as.numeric(table(ctsF)) dat = rbind(o, eCl, ePo) names(dat) = 0:5 @ Finally you plot the observed versus the expected from the cluster and Poisson models using a bar plot where the bars are plotted side-by-side. <>= barplot(dat, ylab="Number of Years", xlab="Number of Florida Hurricanes", names.arg=c(0:5), col=c("black", "red", "blue"), legend=c("Observed", "Cluster", "Poisson"), beside=TRUE) @ \begin{figure} \centering <>= par(las=1, mgp=c(2, .4, 0), tcl=-.3) barplot(dat, ylab="Number of years", xlab="Number of Florida hurricanes", names.arg=c(0:5), col=c("black", "red", "blue"), legend=c("Observed", "Cluster", "Poisson"), beside=TRUE) @ \vspace{-.7cm} \caption{Observed versus expected number of Florida hurricane years.} \label{chap11:fig:sidebysidebarplot} \end{figure} Results are shown in Fig.~\ref{chap11:fig:sidebysidebarplot}. The expected numbers are based on a cluster model ($p$ = 0.137) and on a Poisson model ($p$ = 0). The cluster model fits the observed counts better than does the Poisson model particularly at the low and high count years. Florida had hurricanes in only two of the 11 years from 2000 through 2010. But these two years featured seven hurricanes. Seasonal forecast models that predict U.S. hurricane activity assume a Poisson distribution. You show here this assumption applied to Florida hurricanes leads to a forecast that under predicts both the number of years without hurricanes and the number of years with three or more hurricanes \citep{JaggerElsner2012}. The lack of fit in the forecast model arises due to clustering of hurricanes along this part of the coast. You demonstrate a temporal cluster model that assumes the rate of hurricane clusters follows a Poisson distribution with the size of the cluster limited to two hurricanes. The model fits the distribution of Florida hurricanes better than a Poisson model when both are conditioned on the NAO and SOI. \section{Spatial Clusters} Is there a tendency for hurricanes to cluster in space? More specifically, given that a hurricane originates at a particular location is it more (or less) likely that another hurricanes will form in the same vicinity? This is a problem for spatial point pattern analysis. Here you consider models for analyzing and modeling events across space. We begin with some definitions. An {\it event} is the occurrence of some phenomenon of interest. For example, an event can be a hurricane's origin or its lifetime maximum intensity. An {\it event location} is the spatial location of the event. For example, the latitude and longitude of the maximum intensity event. A {\it point} is any location where an event could occur. A {\it spatial point pattern} is a collection of events and event locations together with spatial domain. The {\it spatial domain} is defined as the region of interest over which events tend to occur. For example, the North Atlantic basin is the spatial domain for hurricanes. To define spatial clustering, it helps to define its absence. A spatial point pattern is defined as random if an event is equally likely to occur at any point within the spatial domain. A spatial point pattern is said to be clustered if given an event at some location it is more likely than random that another event will occur nearby. Regularity is the opposite; given an event, if it is less likely than random that another event will occur nearby, then the spatial point pattern is regular. Said another way, complete spatial randomness (CSR) defines a situation where an event is equally likely to occur at any location within the study area regardless of the locations of other events. A {\it realization} is a collection of spatial point patterns generated under a spatial point process model. To illustrate, consider four point patterns each consisting of events inside the unit plane. You generate the event locations and plot them by typing <>= par(mfrow=c(2, 2), mex=.9, pty="s") for(i in 1:4){ u = runif(n=30, min=0, max=1) v = runif(n=30, min=0, max=1) plot(u, v, pch=19, xlim=c(0, 1), ylim=c(0, 1)) } @ \begin{figure} \centering <>= par(mfrow=c(2, 2), mex=.5) for(i in 1:4){ u = runif(n=30, min=0, max=1) v = runif(n=30, min=0, max=1) plot(u, v, pch=19, xaxt="n", yaxt="n", xlab="", ylab="", xlim=c(0, 1), ylim=c(0, 1)) } @ \vspace{-.7cm} \caption{Point patterns exhibiting complete spatial randomness.} \label{chap11:fig:csrexamples} \end{figure} The pattern of events in Fig.~\ref{chap11:fig:csrexamples} illustrates that some amount of clustering occurs by chance. The {\bf spatstat} package \citep{BaddeleyTurner2005} contains a large number of functions for analyzing and modeling point pattern data. To make the functions available and obtain a citation type <>= require(spatstat) citation(package="spatstat") @ Complete spatial randomness lies on a continuum between clustered and regular spatial point patterns. You use simulation functions in {\bf spatstat} to compare the CSR plots in Fig.~\ref{chap11:fig:csrexamples} with regular and cluster patterns. Examples are shown in Fig.~\ref{chap11:fig:clusterexamples}. \begin{figure} \centering <>= par(mfrow=c(2, 2), mex=.5) plot(rMaternI(kappa=80, r=.05), pch=19, main="") plot(rMaternI(kappa=80, r=.05), pch=19, main="") plot(rMatClust(kappa=20, r=.15, mu=4), pch=19, main="") plot(rMatClust(kappa=20, r=.15, mu=4), pch=19, main="") @ \vspace{-.7cm} \caption{Regular (top) and clustered (bottom) point patterns.} \label{chap11:fig:clusterexamples} \end{figure} Here you can see two realizations of patterns more regular than CSR (top row), and two of patterns more clustered than CSR (bottom row). The simulations are made using a spatial point pattern model. Spatial scale plays a role. A set of events can indicate a regular spatial pattern on one scale but a clustered pattern on anther. \subsection{Point processes} Given a set of spatial events (spatial point pattern) your goal is to assess evidence for clustering (or regularity). Spatial cluster detection methods are based on statistical models for spatial point processes. The random variable in these models is the event locations. Here we provide some definitions useful for understanding point process models. Details on the underlying theory are available in \cite{Ripley1981}, \cite{Cressie1993}, and \cite{Diggle2003}. A process generating events is {\it stationary} when it is invariant to translation across the spatial domain. This means that the relationship between two events depends only on their positions relative to one another and not on the event locations themselves. This relative position refers to (lag) distance and orientation between the events. A process is {\it isotropic} when orientation does not matter. Recall these same concepts applied to the variogram models used in Chapter~\ref{chap:spatialmodels}. Given a single realization, the assumptions of stationarity and isotropy allow for replication. Two pairs of events in the same realization of a stationary process that are separated by the same distance should have the same relatedness. This allows you to use relatedness information from one part of the domain as a replicate for relatedness from another part of the domain. The assumptions of stationarity and isotropy let you begin and can be relaxed later. The Poisson distribution is used to define a model for CSR. A spatial point process is said to be {\it homogeneous} Poisson under the following two criteria: \begin{itemize} \item The number of events occurring within a finite region $A$ is a random variable following a Poisson distribution with mean $\lambda \times A$ for some positive constant $\lambda$, with $|A|$ denoting the area of $A$. \item Given the total number of events $N$, their locations are an independent random sample of points where each point is equally likely to be picked as an event location. \end{itemize} The first criteria is about the density of the spatial process. For a given domain it answers the question, how many events? It is the number of events divided by the domain area. The density is an estimate of the rate parameter of the Poisson distribution.\footnote{In the spatial statistics this is often called the {\it intensity}. Here we use the term {\it density} instead so as not to confuse the spatial rate parameter with hurricane strength.} The second criteria is about homogeneity. Events are scattered throughout the domain and are not clustered or regular. It helps to consider how to create a homgeneous Poisson point pattern. The procedure follows straight from the definition. Step 1: Generate the number of events using a Poisson distribution with mean equal to $\lambda$. Step 2: Place the events inside the domain using a uniform distribution for both spatial coordinates. For example, let area of the domain be one and the density of events be 200, then you type <>= lambda = 200 N = rpois(1, lambda) x = runif(N); y=runif(N) plot(x, y, pch=19) @ Note that if your domain is not a rectangle, you can circumscribe it within a rectangle, then reject events sampled from inside the rectangle that fall outside the domain. This is called `rejection sampling.' By definition these point patterns are CSR. However, as noted above you are typically in the opposite position. You observe a set of events and want to know if they are regular or clustered. Your null hypothesis is CSR and you need a test statistic that will guide your inference. The above demonstrates the null models are easy to construct, so you can use Monte Carlo methods. In some cases the homogeneous Poisson model is too restrictive. Consider hurricane genesis as an event process. Event locations may be random but the probability of an event is higher away from the equator. The constant risk hypothesis undergirding the homogeneous Poisson point pattern model requires a generalization to include a spatially varying density function. To do this you define the density $\lambda(s)$, where $s$ denotes spatial points. This is called a {\it inhomogeneous} Poisson model and it is analogous to the universal kriging model used on field data (see Chapter~\ref{chap:spatialmodels}). Inhomogeneity as defined by a spatially varying density implies non-stationarity as the number of events depends on location. \subsection{Spatial density} Density is a first-order property of a spatial point pattern. That is; the density function estimates the mean number of events at any point in your domain. Events are independent of one another, but event clusters appear because of the varying density. Given an observed set of events, how do you estimate $\lambda(s)$? One approach is to use kernel densities. Consider again the set of hurricanes over the period 1944--2000 that were designated tropical only and baroclinically enhanced (Chapter~\ref{chap:frequencymodels}). Input these data and create a spatial points data frame of the genesis locations by typing <>= bh = read.csv("bi.csv", header=TRUE) require(sp) coordinates(bh) = c("FirstLon", "FirstLat") ll = "+proj=longlat +ellps=WGS84" proj4string(bh) = CRS(ll) @ Next convert the geographic coordinates using the Lambert conformal conic projection true at parallels 10 and 40$^\circ$N and a center longitude of 60$^\circ$W. First save the reference system as a CRS object then use the \verb@spTransform@ function from the {\bf rgdal} package. <>= lcc = "+proj=lcc +lat_1=40 +lat_2=10 +lon_0=-60" require(rgdal) bht = spTransform(bh, CRS(lcc)) @ The spatial distance unit is meters. Use the \verb@map@ ({\bf maps}) and the \verb@map2SpatialLines@ ({\bf maptools}) to obtain country borders by typing <>= require(maps) require(maptools) brd = map("world", xlim=c(-100, -30), ylim=c(10, 48), plot=FALSE) brd_ll = map2SpatialLines(brd, proj4string=CRS(ll)) brd_lcc = spTransform(brd_ll, CRS(lcc)) @ You use the same coordinate transformation on the map borders as you do on the cyclone locations. Next you need to convert your S4 class objects (\verb@bh@ and \verb@clp@) into S3 class objects for use with the functions in the {\bf spatstat} package. <>= require(spatstat) bhp = as.ppp(bht) clpp = as.psp(brd_lcc) @ The spatial point pattern object \verb@bhp@ contains marks. A mark is attribute information at each event location. The marks are the columns from the original data that were not used for location. You are interested only in the hurricane type (either tropical only or baroclinic) so you reset the marks accordingly by typing <>= marks(bhp) = bht$Type == 0 @ You summarize the object with the \verb@summary@ method. The summary includes an average density over the spatial domain. The density is per unit area. Your native length unit is meter from the Lambert planar projection so your density is per square meter. You retrieve the average density in units of per (1000 km)$^2$ by typing <>= summary(bhp)$intensity * 1e+12 @ Thus, on average, each point in your spatial domain has slightly more than ten hurricane origins per one thousand squared kilometers. This is the mean spatial density. Some areas have more or less than the mean so you would like an estimate of $\lambda(s)$. You do this using the \verb@density@ method, which computes a kernel smoothed density function from your point pattern object. By default the kernel is gaussian with a bandwidth equal to the standard deviation of the isotropic kernel. The default output is a pixel image containing local density estimates. Here again you convert the density values to have units of 1000 squared kilometers. <>= den = density(bhp) den$v = den$v * 1e+12 @ You use the \verb@plot@ method first to plot the density image, then the country border, and finally the event locations. <>= plot(den) plot(unmark(clpp), add=TRUE) plot(unmark(bhp), pch=19, cex=.3, add=TRUE) @ Event density maps split by tropical only and baroclinic hurricane genesis are shown in Fig.~\ref{chap11:fig:densitymaps}. Here we converted the {\it im} object to a {\it SpatialGridDataFrame} object and used the \verb@spplot@ method. Regions of the central Atlantic extending westward through the Caribbean into the southern Gulf of Mexico have the greatest tropical-only density generally exceeding ten hurricane origins per (1000~km)$^2$. In contrast, regions off the eastern coast of the United States extending eastward to Bermuda have the greatest baroclinic density. The amount of smoothing is determined to a large extent by the bandwidth and to a much smaller extent by the type of kernel. \begin{figure} \centering <>= to = bhp[bhp$marks == TRUE] be = bhp[bhp$marks == FALSE] den1 = density(to) den1$v = den1$v * 1e+12 den.sgdf = as(den1, "SpatialGridDataFrame") den2 = density(be) den2$v = den2$v * 1e+12 den.sgdf$v2 = as(den2, "SpatialGridDataFrame")$v l3 = list("sp.lines", brd_lcc, lwd=.3, first=FALSE) p1 = spplot(den.sgdf, "v", col.regions=rev(topo.colors(20)), sp.layout=l3, at=seq(0, 30, 5), colorkey=list(space="bottom", labels=list(cex=.7)), sub=list("Density [Events/(1000 km)$^2$]", cex=.7, font=1)) p2 = spplot(den.sgdf, "v2", col.regions=rev(topo.colors(20)), sp.layout=l3, at=seq(0, 30, 5), colorkey=list(space="bottom", labels=list(cex=.7)), sub=list("Density [Events/(1000 km)$^2$]", cex=.7, font=1)) require(grid) p1 = update(p1, main=textGrob("a", x=unit(.05, "npc"))) p2 = update(p2, main=textGrob("b", x=unit(.05, "npc"))) plot(p1, split=c(1, 1, 2, 1), more=TRUE) plot(p2, split=c(2, 1, 2, 1), more=FALSE) @ \vspace{-1.5cm} \caption{Genesis density for (a) tropical-only and (b) baroclinic hurricanes.} \label{chap11:fig:densitymaps} \end{figure} Densities at the genesis locations are made using the argument \verb@at="points"@ in the \verb@density@ call. Also, the number of events in grid boxes across the spatial domain can be obtained using the \verb@quadratcount@ or \verb@pixellate@ functions. For example, type <>= plot(quadratcount(bhp)) plot(pixellate(bhp, dimyx=5)) @ \subsection{Second-order properties} The density function above describes the rate (mean) of hurricane genesis events locally. Second-order functions describe the variability of events. Ripley's K function is an example. It is defined as \begin{equation} \hat{K}(s) = \lambda^{-1}n^{-1}\sum_{i\ne j} I(d_{ij} \pi s^2$. The function \verb@Kest@ is used to compute $\hat{K}(s)$. Here you save the results by typing <>= k = Kest(bhp) @ The function takes an object of class \verb@ppp@ and computes $\hat{K}(s)$ using different formulas that correct for edge effects to reduce bias arising from counting hurricanes near the borders of your spatial domain \cite[]{Ripley1991, BaddeleyEtAl2000}. The K function is defined such that $\lambda \hat{K}(s)$ equals the expected number of additional hurricanes within a distance $s$ of any other hurricane. You plot the expected number as a function of separation distance by typing <>= lam = summary(bhp)$intensity m2km = .001 plot(k$r * m2km, k$iso * lam, type="l", lwd=2, xlab="Lag Distance (s) [km]", ylab="Avg Number of Nearby Hurricanes") @ You first save the value of $\lambda$ and a conversion factor to go from meters to kilometers. The \verb@iso@ column in results object refers to the K function computed using the isotropic correction for regular polygon domains. The \verb@Kest@ function also returns theoretical values under the hypothesis of CSR. You add these to your plot by typing <>= lines(k$r * m2km, k$theo * lam, col="red", lwd=2) @ The empirical curve lies above the theoretical curve at all lag distances. For instance at a lag distance of \Sexpr{round(k$r[247]*m2km,0)}~km, there are on average about \Sexpr{round(k$iso[247]*lam,0)} additional hurricanes nearby. This compares with an expected number of \Sexpr{round(k$theo[247]*lam,0)} additional hurricanes. This indicates that for a given random hurricane, there are more nearby hurricanes than you would expect by chance under the assumption of CSR. This is indicative of clustering. But as you show above, the clustering is related to the inhomogeneous distribution of hurricanes events across the basin. So this result is not particularly useful. Instead you use the \verb@Kinhom@ function to compute a generalization of the K function for inhomogeneous point patterns \citep{BaddeleyEtAl2000}. The results are shown in Fig.~\ref{chap11:fig:kfunctionplot}. Black curves are the empirical estimates and red curves are the theoretical. The empirical curve is much closer to the inhomogeneous theoretical curve. \begin{figure} \centering <>= m2km = .001 den = density(bhp) k = Kest(bhp) k2 = Kinhom(bhp, lambda=den) #e = envelope(bhp, Kinhom, nsim=99, verbose=FALSE) lam = summary(bhp)$intensity par(mfrow=c(1, 2), las=1, mgp=c(2, .4, 0), tcl=-.3) plot(k$r * m2km, k$iso * lam, type="l", lwd=2, ylim=c(0, 55), xlab="Lag distance (s) [km]", ylab="Avg number of nearby hurricanes") grid() lines(k$r * m2km, k$iso * lam) lines(k$r * m2km, k$theo * lam, col="red", lwd=2) mtext("a", side=3, line=1, adj=0, cex=1.1) plot(k2$r * m2km, k2$iso * lam, type="l", lwd=2, ylim=c(0, 55), xlab="Lag distance (s) [km]", ylab="Avg number of nearby hurricanes") grid() #xx = c(e$r * m2km, rev(e$r * m2km)) #yy = c(e$lo * lam, rev(e$hi * lam)) #polygon(xx, yy, border=NA, col="gray") lines(k2$r * m2km, k2$iso * lam, lwd=2) lines(k2$r * m2km, k2$theo * lam, col="red", lwd=2) mtext("b", side=3, line=1, adj=0, cex=1.1) @ \vspace{-.7cm} \caption{2nd-order genesis statistics. (a) Ripley's K and (b) generalization of K.} \label{chap11:fig:kfunctionplot} \end{figure} There appears to be some clustering of hurricane genesis at short distances and regularity at larger distances. The regularity at large distance is likely due to land. The analysis can be improved by masking land areas rather than using a rectangular window. \subsection{Models} You can a fit parametric model to your point pattern. This allows you to predict the occurrence of hurricane events given covariates and the historical spatial patterns including clustering and regularity. The scope of models is quite wide, but the {\bf spatstat} package contains quite a few options using the \verb@ppm@ function. Models can include spatial trend, dependence on covariates, and event interactions of any order (you are not restricted to pairwise interactions). Models are fit using the method of maximum pseudo-likelihood, or using an approximate maximum likelihood method. Model syntax is standard R. Typically the null model is homogeneous Poisson. \section{Feature Clusters} In the first two sections of this chapter you examined clustering in time and geographic space. It is also possible to cluster in feature space. Indeed, this is the best known of the cluster methods. Feature clustering is called cluster analysis. Cluster analysis searches for groups in feature (attribute) data. Objects belonging to the same cluster have similar features. Objects belonging to different clusters have dissimilar features. In two or three dimensions, clusters can be visualized. In more than three dimensions analytic assistance is helpful. Note the difference with the two previous cluster methods. With those methods your initial goal was cluster detection with the ultimate goal to use the clusters to improve prediction. Here your goal is descriptive and cluster analysis is a data-reduction technique. You start with the assumption that your data can be grouped based on feature similarity. Cluster analysis methods fall into two distinct categories: partition and hierarchical. Partition clustering divides your data into a pre-specified number of groups. To help you decide on the number of clusters you often have to try several different numbers. An index of cluster quality might provide an easy call as to the `best' number. Hierarchical clustering creates increasing or decreasing sets of clustered groups from your data. Agglomerative hierarchical starts with each object forming its own individual cluster. It then successively merges clusters until a single large cluster remains (your entire data). Divisive hierarchical is just the opposite. It starts with a single super cluster that includes all objects and proceeds to successively split the clusters into smaller groups. To cluster in feature space you need to: \begin{enumerate} \item Create a dissimilarity matrix from a set of objects, and \item Group the objects based on the values of the dissimilarity matrix. \end{enumerate} \subsection{Dissimilarity and distance} The dissimilarity between two objects measures how different they are. The larger the dissimilarity the greater the difference. Objects are considered vectors in attribute (feature) space, where vector length equals the number of data set attributes. Consider for instance a data set with three attributes and four objects (Table~\ref{tab:attributesandobjects}). \begin{table} \caption{\label{tab:attributesandobjects} Attributes and objects. A data set ready for cluster analysis.} \begin{center} \begin{tabular}{lccc} \hline & Feature 1 & Feature 2 & Feature 3 \\ \hline Object 1 & $x_{1,1}$ & $x_{1,2}$ & $x_{1,3}$ \\ Object 2 & $x_{2,1}$ & $x_{2,2}$ & $x_{2,3}$ \\ Object 3 & $x_{3,1}$ & $x_{3,2}$ & $x_{3,3}$ \\ Object 4 & $x_{4,1}$ & $x_{4,2}$ & $x_{4,3}$ \\ \hline \end{tabular} \end{center} \end{table} Object 1 is a vector consisting of the triplet ($x_{1,1}$, $x_{1,2}$, $x_{1,3}$), object 2 is a vector consisting of the triplet ($x_{2,1}$, $x_{2,2}$, $x_{2,3}$), and so on. Then the elements of a dissimilarity matrix are pairwise distances between the objects in feature space. As an example, an object might be a hurricane track with features that include genesis location, lifetime maximum intensity, and lifetime maximum intensification. Although distance is an actual metric, the dissimilarity function need not be. The minimum requirements for a dissimilarity measure $d$ are: \begin{itemize} \item $d_{i,i} = 0$ \item $d_{i,j} \ge 0$ \item $d_{i,j} = d_{j,i}$. \end{itemize} The following axioms of a proper metric do not need to be satisfied: \begin{itemize} \item $d_{i,k} \leq d_{i,j} + d_{j,k}$ triangle inequality \item $d_{i,j}=0$ implies $i=j$. \end{itemize} Before clustering you need to arrange your data set as a $n \times p$ data matrix, where $n$ is the number of rows (one for each object) and $p$ is the number of columns (one for each attribute variable). How you compute dissimilarity depends on your attribute variables. If the variables are numeric you use Euclidean or Manhattan distance given as \begin{eqnarray} d^E_{i,j}=\sqrt{\sum_{f=1}^p (x_{i,f}-x_{j,f})^2} \\ d^M_{i,j}=\sum_{f=1}^p |x_{i,f}-x_{j,f}| \end{eqnarray} The summation is over the number of attributes (features). With hierarchical clustering the maximum distance norm, given by \begin{equation} d^{\hbox{max}}_{i,j} = \hbox{max}(|x_{i,f}-x_{j,f}|, f=1, \ldots, p) \end{equation} is sometimes used. Measurement units on your feature variables influence the relative distance values which, in turn, will affect your clusters. Features with high variance will have the largest impact. If all features are deemed equally important to your grouping, then the data need to be standardized. You can do this with the \verb@scale@ function, whose default method centers and scales the columns of your data matrix. The \verb@dist@ function computes the distance between objects using a variety of methods including Euclidean (default), Manhattan, and maximum. Create two feature vectors each having five values and plot them with object labels in feature space. <>= x1 = c(2, 1, -3, -2, -3) x2 = c(1, 2, -1, -2, -2) plot(x1, x2, xlab="Feature 1", ylab="Feature 2") text(x1, x2, labels=1:5, pos=c(1, rep(4, 4))) @ From the plotted points you can easily group the two features into two clusters by eye. There is an obvious distance separation between the clusters. To actually compute pairwise Euclidean distances between the five objects, type <>= d = dist(cbind(x1, x2)) d @ The result is a vector of distances, but printed as a table with the object numbers listed as row and column names. The values are the pairwise distances so for instance the distance between object 1 and object 2 is 1.41 units. You can see two distinct clusters of distances (dissimilarities) those less or equal to 1.41 and those greater or equal to 5. Objects 1 and 5 are the most dissimilar followed by objects 2 and 5. On the other hand, objects 3 and 5 and 4 and 5 are the most similar. You can change the default Euclidean distance to the Manhattan distance by including \verb@method="man"@ in the \verb@dist@ function. If the feature vectors contain values that are not continuous numeric (e.g., factors, ordered, binary) or if there is a mixture of data types (one feature is numeric the other is a factor, for instance), then dissimilarities need to be computed differently. The {\bf cluster} package contains functions for cluster analysis including \verb@daisy@ for dissimilarity matrix calculations, which by default uses Euclidean distance as the dissimilarity metric. To test, type <>= require(cluster) d = daisy(cbind(x1, x2)) d @ Note the values that make up the dissimilarity matrix are the same as the distance values above, but additional information is saved including the metric used and the total number of objects in your data set. The function contains a logical flag called \verb@stand@ that when set to true standardizes the feature vectors before calculating dissimilarities. If some of the features are not numeric, the function computes a generalized coefficient of dissimilarity \citep{Gower1971}. \subsection{K-means clustering} Partition clustering requires you to specify the number of clusters beforehand. This number is denoted $k$. The algorithm allocates each object in your data frame to one and only one of the $k$ clusters. The $k$-means method is the most popular. Membership of an object is determined by its distance from the centroid of each cluster. The centroid is the multidimensional version of the mean. The method alternates between calculating the centroids based on the current cluster members, and reassigning objects to clusters based on the new centroids. For example, in deciding which of the two clusters an object belongs, the method computes the dissimilarity between the object and the centroid of cluster one and between the object and the centroid of cluster two. It then assigns the object to the cluster with the smallest dissimilarity and recomputes the centroid with this new object included. An object already in this cluster might now have greater dissimilarity due to the reposition of the centroid, in which case it gets reassigned. Assignments and reassignments continue in this way until all objects have a membership. The \verb@kmeans@ function from the base {\bf stat} package performs $k$-means clustering. By default it uses the algorithm of \cite{HartiganWong1979}. The first argument is the data frame (not the dissimilarity matrix) and the second is the number of clusters (centers). To perform a $k$-means cluster analysis on the example data above, type <>= dat = cbind(x1, x2) ca = kmeans(dat, centers=2) ca @ Initial centroids are chosen at random so it is recommended to rerun the algorithm a few times to see if it arrives at the same groupings. While the algorithm minimizes within-cluster variance, it does not ensure that there is a global minimum variance across all clusters. The first bit of output gives the number of clusters (input) and the size of the clusters that results. Here cluster 1 has two members and cluster 2 has three. The next bit of output are the cluster centroids. There are two features labeled \verb@x1@ and \verb@x2@. The rows list the centroids for clusters 1 and 2 as vectors in this two-dimensional feature space. The centroid of the first cluster is ($-$2.67, $-$1.67) and the centroid of the second cluster is (1.5, 1.5). The centroid is the feature average using all objects in the cluster. You can see this by adding the centroids to your plot. <>= points(ca$centers, pch=8, cex=2) text(ca$centers, pos=c(4, 1), labels=c("Cluster 2", "Cluster 1")) @ The next bit of output tags each object with cluster membership. Here you see the first two objects belong to cluster 2 and the next three belong to cluster 1. The cluster number keeps track of distinct clusters but the numerical order is irrelevant. The within-cluster sum of squares is the sum of the distances between each object and the cluster centroid for which it is a member. From the plot you can see that the centroids (stars) minimize the within cluster distances while maximizing the between cluster distances. The function \verb@pam@ is similar but uses medoids rather than centroids. A medoid is an multidimensional center based on medians. The method accepts a dissimilarity matrix, tends to be more robust (converges to the same result), and provides for additional graphical displays from the {\bf cluster} package. \subsection{Track clusters} Your interest is to group hurricanes according to common track features. These features include location of hurricane origin, location of lifetime maximum intensity, location of final hurricane intensity, and lifetime maximum intensity. The three location features are further subdivided into latitude and longitude making a total of seven attribute (feature) variables. Note that location is treated here as an attribute rather than as a spatial coordinate. You first create a data frame containing only the attributes you wish to cluster. Here you use the cyclones since 1950. <>= load("best.use.RData") best = subset(best.use, Yr >= 1950) @ You then split the data frame into separate cyclone tracks using \verb@Sid@ with each element in the list as a separate track. You are interested in identifying features associated with each track. <>= best.split = split(best, best$Sid) @ You then assemble a data frame using only the attributes to be clustered. <>= best.c = t(sapply(best.split, function(x){ x1 = unlist(x[1, c("lon", "lat")]) x2 = unlist(x[nrow(x), c("lon", "lat")]) x3 = max(x$WmaxS) x4 = unlist(x[rev(which(x3 == x$WmaxS))[1], c("lon", "lat")]) return(c(x1, x2, x3, x4)) })) best.c = data.frame(best.c) colnames(best.c) = c("FirstLon", "FirstLat", "LastLon", "LastLat", "WmaxS", "MaxLon", "MaxLat") @ The data frame contains \Sexpr{dim(best.c)[2]} features from \Sexpr{dim(best.c)[1]} cyclones. Before clustering you check the feature variances by applying the \verb@var@ function on the columns of the data frame using the \verb@sapply@ function. <>= sapply(best.c, var) @ The variances range from a minimum of \Sexpr{round(sapply(best.c,var)[2],1)} degrees squared for the latitude of origin feature to a maximum of \Sexpr{round(sapply(best.c,var)[5],1)} knots squared for the maximum intensity feature. Because of the rather large range in variances you scale the features before performing cluster analysis. This is important if your intent is to have each feature have the same influence on the clustering. <>= best.cs = scale(best.c) m = attr(best.cs, "scaled:center") s = attr(best.cs, "scaled:scale") @ By default the function \verb@scale@ centers and scales the columns of your numeric data frame. The center and scale values are saved as attributes in the new data frame. Here you save them to rescale the centroids after the cluster analysis. You perform a $k$-means cluster analysis setting the number of clusters to three by typing <>= k = 3 ct = kmeans(best.cs, centers=k) summary(ct) @ The output is a list of length seven containing the cluster vector, cluster means (centroids), the total sum of squares, the within cluster sum of squares by cluster, the total within sum of squares, the between sum of squares, and the size of each cluster. Your cluster means are scaled so they are not readily meaningful. The cluster vector gives the membership of each hurricane in order as they appear in your data set. The ratio of the between sum of squares to the total sum of squares is \Sexpr{round(ct$betweenss/ct$totss*100,1)}\%. This ratio will increase with the number of clusters, but at the expense of having clusters that are not physically interpretable. With four clusters, the increase in this ratio is smaller than the increase going from two to three clusters. So you are content with your three-cluster solution. \subsection{Track plots} Since six of the seven features are spatial coordinates it's tempting to plot the centroids on a map and connect them with a track line. This would be a mistake as there is not a geographic representation of your feature clusters. Instead, you plot examples of cyclones that resemble each cluster. First, you add cluster membership and distance to your original data frame then split the data by cluster member. <>= ctrs = ct$center[ct$cluster, ] cln = ct$cluster dist = rowMeans((best.cs - ctrs)^2) id = 1:length(dist) best.c = data.frame(best.c, id=id, dist=dist, cln=cln) best.c.split = split(best.c, best.c$cln) @ Next you subset your cluster data based on the tracks that come closest to the cluster centroids. This closeness is in feature space that includes latitude and longitude but also intensity. Here you choose 9 tracks for each centroid. <>= te = 9 bestid = unlist(lapply(best.c.split, function(x) x$id[order(x$dist)[1:te]])) cinfo = subset(best.c, id %in% bestid) @ Finally you plot the tracks on a map. This requires a few steps to make the plot easier to read. Begin by setting the bounding box using latitude and longitude of your cluster data and renaming your objects. <>= cyclones = best.split uselat = range(unlist(lapply(cyclones[cinfo$id], function(x) x$lat))) uselon = range(unlist(lapply(cyclones[cinfo$id], function(x) x$lon))) @ Next, order the clusters by number and distance and set the colors for plotting. Use the \verb@brewer.pal@ function in the {\bf RColorBrewer} package \cite{Neuwirth2011}. You use a single hue sequential color ramp that allows you to highlight the cyclone tracks that are closest to the feature clusters. <>= cinfo = cinfo[order(cinfo$cln, cinfo$dist), ] require(RColorBrewer) blues = brewer.pal(te, "Blues") greens = brewer.pal(te, "Greens") reds = brewer.pal(te, "Reds") cinfo$colors = c(rev(blues), rev(greens), rev(reds)) @ Next, reorder the tracks for plotting them as a weave with the tracks farthest from the centroids plotted first. <>= cinfo = cinfo[order(-cinfo$dist, cinfo$cln), ] @ Finally, plot the tracks and add world and country borders. Results are shown in Fig.~\ref{chap11:fig:tracksbycluster}. Track color is based on attribute proximity to the cluster centroid using a color saturation that decreases with distance. <>= plot(uselon, uselat, type="n", xaxt="n", yaxt="n", ylab="", xlab="") for(i in 1:nrow(cinfo)){ cid = cinfo$id[i] cyclone = cyclones[[cid]] lines(cyclone$lon, cyclone$lat, lwd=2, col=cinfo$colors[i]) } require(maps) map("world", add=TRUE) map("usa", add=TRUE) @ \begin{figure} \centering <>= require(maps) plot(uselon, uselat, type="n", xaxt="n", yaxt="n", ylab="", xlab="") map("world", add=TRUE) map("usa", add=TRUE) for(i in 1:nrow(cinfo)){ cid = cinfo$id[i] cyclone = cyclones[[cid]] n = length(cyclone$lon) lines(cyclone$lon, cyclone$lat, lwd=2, col=cinfo$colors[i]) arrows(cyclone$lon[n - 1], cyclone$lat[n - 1], cyclone$lon[n], cyclone$lat[n], lwd=2, length=.1, col=cinfo$colors[i]) } @ \vspace{-1.7cm} \caption{Tracks by cluster membership.} \label{chap11:fig:tracksbycluster} \end{figure} The analysis splits the cyclones into a group over the Gulf of Mexico, a group at high latitudes, and a group that begins at low latitudes but ends at high latitudes generally east of the United States. Some of the cyclones that begin at high latitude are baroclinic (see Chapter~\ref{chap:frequencymodels}). Cluster membership can change depending on the initial random centroids particularly for tracks that are farthest from a centroid. The two- and three-cluster tracks are the most stable. The approach can be extended to include other track features including velocity, acceleration, and curvature representing more of the space and time behavior of hurricanes. Model-based clustering, where the observations are assumed to be quantified by a finite mixture of probability distributions, is an attractive alternative if you want to make inferences about future hurricane activity. In the end it's worthwhile to keep in mind the advice of Bill Venables and Brian Ripley. You should not assume cluster analysis is the best way to discover interesting groups. Indeed, visualization methods are often more effective in this task. This chapter showed you how to how to detect, model, and analyze clusters in hurricane data. We began by showing you how to detect and model the arrival of hurricanes in Florida. We then showed you how to detect and analyze the presence of clusters in spatial events arising from hurricane genesis locations. We looked at the first and second order statistical properties of spatial point data. Lastly we showed you how to apply cluster analysis to hurricane track features and map representative tracks.