date()
## [1] "Mon Dec 3 20:28:40 2012"
# setwd('~/Dropbox/Book/Chapter11')
Our interest is to group hurricanes according to common track features. These features include location of hurricane origin, location of lifetime maximum intensity, location of final hurricane intensity, and lifetime maximum intensity. The three location features are further subdivided into latitude and longitude making a total of seven attribute (feature) variables. Location is treated here as an attribute rather than as a spatial coordinate.
First we create a data frame containing only the attributes we wish to use to in the cluster analysis. For illustration we use the cycones only since 1950.
load("best.use.RData")
best = subset(best.use, Yr >= 1950)
Next we split the data frame into separate cyclone tracks using the unique storm identification (Sid) with each element in the list as a separate track. We want to identify features associated with each track.
best.split = split(best, best$Sid)
Next we assemble a data frame using only the attributes to be clustered.
best.c = t(sapply(best.split, function(x) {
x1 = unlist(x[1, c("lon", "lat")])
x2 = unlist(x[nrow(x), c("lon", "lat")])
x3 = max(x$WmaxS)
x4 = unlist(x[rev(which(x3 == x$WmaxS))[1], c("lon", "lat")])
return(c(x1, x2, x3, x4))
}))
best.c = data.frame(best.c)
colnames(best.c) = c("FirstLon", "FirstLat", "LastLon", "LastLat", "WmaxS",
"MaxLon", "MaxLat")
The data frame contains 7 features from 667 cyclones.
Before clustering we check the feature variances by applying the var() function on the columns of the data frame using the sapply() function.
sapply(best.c, var)
## FirstLon FirstLat LastLon LastLat WmaxS MaxLon MaxLat
## 499.21 57.04 710.41 164.82 870.69 356.72 75.90
Variance ranges from a minimum of 57 degrees squared for the latitude of origin feature to a maximum of 870.7 knots squared for the maximum intensity feature. Because of the rather large range in variance we scale the features before performing cluster analysis. This is important since we want each feature to have the same influence on the grouping.
best.cs = scale(best.c)
m = attr(best.cs, "scaled:center")
s = attr(best.cs, "scaled:scale")
By default the function scale() centers and scales the columns of the numeric data frame. Center and scale values are saved as attributes in the data frame.
The \( k \)-means method is a popular cluster partitioning. Track membership is determined by distance from the cluster centroids. The centroid is the multidimensional version of the mean. The method alternates between calculating the centroids based on the current cluster members, and reassigning tracks to clusters based on the new centroids.
We perform a \( k \)-means cluster analysis setting the number of clusters to three by typing
k = 3
ct = kmeans(best.cs, centers = k)
summary(ct)
## Length Class Mode
## cluster 667 -none- numeric
## centers 21 -none- numeric
## totss 1 -none- numeric
## withinss 3 -none- numeric
## tot.withinss 1 -none- numeric
## betweenss 1 -none- numeric
## size 3 -none- numeric
The output is a list of length seven containing the cluster vector, cluster means (centroids), the total sum of squares, the within cluster sum of squares by cluster, the total within sum of squares, the between sum of squares, and the size of each cluster.
Cluster means are scaled so they are not readily meaningful. The cluster vector gives the membership of each hurricane in order as they appear in the original data. The ratio of the between sum of squares to the total sum of squares is 44.7%. This ratio increases with the number of clusters, but at the expense of having clusters that are not physically interpretable. With four clusters, the increase in this ratio is smaller than the increase going from two to three clusters. So we are content with a three-cluster solution.
Since six of the seven features are spatial coordinates it's tempting to plot the centroids on a map and connect them with a path. This is a mistake as there is not a geographic representation of the clusters. Instead, we plot examples of hurricanes that resemble each cluster.
First, we add cluster membership and distance to the original data frame and then split the data by cluster member.
ctrs = ct$center[ct$cluster, ]
cln = ct$cluster
dist = rowMeans((best.cs - ctrs)^2)
id = 1:length(dist)
best.c = data.frame(best.c, id = id, dist = dist, cln = cln)
best.c.split = split(best.c, best.c$cln)
Next we subset the data based on the tracks that come closest to the cluster centroids. Closeness occurs in feature space that includes latitude and longitude but also intensity. Here we choose nine tracks for each centroid.
te = 9
bestid = unlist(lapply(best.c.split, function(x) x$id[order(x$dist)[1:te]]))
cinfo = subset(best.c, id %in% bestid)
Finally we plot the tracks on a map. This requires a few steps to make the plot easier to read and interpret. We begin by setting the bounding box using latitude and longitude of our data and renaming the objects.
cyclones = best.split
uselat = range(unlist(lapply(cyclones[cinfo$id], function(x) x$lat)))
uselon = range(unlist(lapply(cyclones[cinfo$id], function(x) x$lon)))
Next we order the clusters by number and distance and set the colors for plotting. Use the brewer.pal() function in the RColorBrewer package. We use a single hue sequential color ramp that allows us to highlight the cyclone tracks that are closest to the cluster centers.
cinfo = cinfo[order(cinfo$cln, cinfo$dist), ]
suppressMessages(require(RColorBrewer))
blues = brewer.pal(te, "Blues")
greens = brewer.pal(te, "Greens")
reds = brewer.pal(te, "Reds")
cinfo$colors = c(rev(blues), rev(greens), rev(reds))
Next we reorder the tracks so that we can plot them as a weave with the tracks farthest from the centroids plotted first.
cinfo = cinfo[order(-cinfo$dist, cinfo$cln), ]
Finally we plot the tracks and add world and country borders. Track color is based on attribute proximity to the cluster centroid using a color saturation that decreases with distance.
plot(uselon, uselat, type = "n", xaxt = "n", yaxt = "n", ylab = "", xlab = "")
for (i in 1:nrow(cinfo)) {
cid = cinfo$id[i]
cyclone = cyclones[[cid]]
lines(cyclone$lon, cyclone$lat, lwd = 2, col = cinfo$colors[i])
}
suppressMessages(require(maps))
map("world", add = TRUE)
map("usa", add = TRUE)
Tracks by cluster membership
The analysis splits the cyclones into a group over the Gulf of Mexico, a group at high latitudes, and a group that begins at low latitudes but ends at high latitudes generally east of the United States. Some of the cyclones that begin at high latitude are baroclinic.
Cluster membership can change depending on the initial random centroids particularly for tracks that are farthest from a centroid. The two- and three-cluster tracks are the most stable. The approach can be extended to include other track features including velocity, acceleration, and curvature representing more of the space and time behavior of hurricanes.
Model-based clustering, where the observations are assumed to be quantified by a finite mixture of probability distributions, is an attractive alternative if you want to make inferences about future hurricane activity. In the end it's worthwhile to keep in mind the advice of Bill Venables and Brian Ripley. You should not assume cluster analysis is the best way to discover interesting groups. Indeed, visualization methods are often more effective in this task.