# Chapter 9: Spatial Models

“Design is a question of substance, not just form.”—Adriano Olivetti

packages: gstat, maptools, sp, rgdal, maps, colorRamps, spdep, spgwr, plyr
source code: NONE
third party: NONE

Here we show you how to use cyclone track data together with climate field data to create spatial models.

Spatial models make use of location information in data. Geographic coordinates locate the cyclone's center on the surface of the earth and wind speed provides an attribute. Spatial models make use of cyclone location separate from cyclone attributes.

## Hexagon Grids

Here we show you how to create a spatial framework for combining cyclone data with spatially continuous climate data. The method tessellates the basin with hexagons and populates them with local cyclone and climate information (Elsner et al. 2012).

### Spatial points data frame

Here you are interested in wind speeds along the entire track for all tropical storms and hurricanes during the 2005 North Atlantic season. You begin by creating a data frame from the best.use.RData file where you subset on year and wind speed and convert the speed to meters per second.

load("best.use.RData")
W.df = subset(best.use, Yr == 2005 & WmaxS >= 34 & Type == "*")
W.df$WmaxS = W.df$WmaxS * 0.5144


The asterisk on for Type indicates tropical cyclone as opposed to a tropical wave or extratropical cyclone. The number of rows in your data frame is the total number of cyclone hours (3010) and you save this by typing

ch = nrow(W.df)


Next, assign the lon and lat columns of the data frame as spatial coordinates using the coordinates() function (sp package). Make a copy of your data frame keeping the unique storm identifier, the spatial coordinates and the wind speed columns.

require(sp)
W.sdf = W.df[c("Sid", "lon", "lat", "WmaxS")]
coordinates(W.sdf) = c("lon", "lat")
str(W.sdf, max.level = 2, strict.width = "cut")

## Formal class 'SpatialPointsDataFrame' [package "sp"] with 5 slots
##   ..@ data       :'data.frame':  3010 obs. of  2 variables:
##   ..@ coords.nrs : int [1:2] 2 3
##   ..@ coords     : num [1:3010, 1:2] -83.9 -83.9 -83.9 -84 -84 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   ..@ bbox       : num [1:2, 1:2] -100 11 -12.4 44.2
##   .. ..- attr(*, "dimnames")=List of 2
##   ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slots


The result is a spatial points data frame with five slots. The data slot is a data frame and contains the attributes (here only wind speed). The coordinate (coord) slot contains the longitude and latitude columns from the original data frame and the coordinate numbers (here two spatial dimensions) are given in the coords.nrs slot.

These two columns are no longer in the data slot. The bounding box (bbox) slot is a matrix containing the maximal extent of the hourly positions as defined by the lower left and upper right longitude/latitude coordinates.

A summary of the information in your spatial points data frame is obtained by typing

summary(W.sdf)

## Object of class SpatialPointsDataFrame
## Coordinates:
##        min    max
## lon -99.99 -12.39
## lat  10.95  44.23
## Is projected: NA
## proj4string : [NA]
## Number of points: 3010
## Data attributes:
##       Sid           WmaxS
##  Min.   :1346   Min.   :17.5
##  1st Qu.:1353   1st Qu.:23.7
##  Median :1360   Median :29.5
##  Mean   :1360   Mean   :33.3
##  3rd Qu.:1367   3rd Qu.:38.0
##  Max.   :1373   Max.   :80.2


There are 3010 hourly cyclone hours. The projection (proj4string) slot contains an NA character indicating that it has not yet been specified.

You give your spatial data frame a coordinate reference system using the CRS() function. Here you specify a geographic reference (intersections of parallels and meridians) as a character string in an object called ll_crs, then use the proj4string() function to generate a CRS object and assign it to your spatial data frame.

ll = "+proj=longlat +datum=WGS84"
proj4string(W.sdf) = CRS(ll)
slot(W.sdf, "proj4string")

## CRS arguments: +proj=longlat +datum=WGS84


Next you transform the geographic CRS into a Lambert conformal conic (LCC) planar projection using the parallels 30 and 60$$^\circ$$ N and a center longitude of 60$$^\circ$$ W. First save the reference system as a CRS object then use the spTransform() function from the rgdal package.

lcc = "+proj=lcc +lat_1=60 +lat_2=30 +lon_0=-60"
require(rgdal)

## Loading required package: rgdal

## rgdal: version: 0.8-5, (SVN revision 449) Geospatial Data Abstraction
## 1.9.0, released 2011/12/29 Path to GDAL shared files:
## /Library/Frameworks/R.framework/Versions/2.15/Resources/library/rgdal/gdal
## Loaded PROJ.4 runtime: Rel. 4.8.0, 6 March 2012, [PJ_VERSION: 480] Path to
## PROJ.4 shared files:
## /Library/Frameworks/R.framework/Versions/2.15/Resources/library/rgdal/proj

W.sdf = spTransform(W.sdf, CRS(lcc))
bbox(W.sdf)

##          min     max
## lon -3999983 3987430
## lat  1493807 5521312


The coordinates are now planar rather than geographical, but the coordinate names remain the same (lon and lat). Transformation to a plane makes it easy to perform geometric calculations.

### Hexagon tessellation

Next you construct a hexagon tessellation of the cyclone tracks. This is done by first situating the points representing the center of each hexagon and then drawing the hexagon boundaries.

You create a set of equally-spaced points in a hexagon grid configuration within the bounding box defined by your spatial points data frame using the spsample() function. Here you specify the number of hexagons, but the actual number will vary depending on the bounding box.

You expand the bounding box by multiplying each coordinate value by 15% and use an offset vector to fix the position of the hexagons over the tracks. The expansion and offset values require a bit of trail-and-error.

set.seed(3042)
bb = bbox(W.sdf)
bb[1, 1:2] = bb[1, 1:2] * 1.15
bb[2, 2] = bb[2, 2] * 1.15
bb[2, 1] = bb[2, 1] * 0.8
hpt = spsample(W.sdf, type = "hexagonal", n = 250, bb = bb, offset = c(0.5,
0.5))


The methods used in spatial sampling function assume that the geometry has planar coordinates, so your spatial data object should not have geographic coordinates.

Next you call the function to convert the points to hexagons. More points result in smaller hexagons.

hpg = HexPoints2SpatialPolygons(hpt)
plot(hpg)
plot(W.sdf, pch = 20, cex = 0.3, add = TRUE)


The number of hexagons generated and the area of each is obtained by typing

np = length(hpg@polygons)
area = hpg@polygons[[1]]@area
np

## [1] 193

area

## [1] 2.187e+11


This results in 193 equal-area hexagons. The length unit of the LCC projection is meters. To convert the area from square meters to square kilometers, multiply by 10$$^{-6}$$. Thus the area of each hexagon is approximately 2.1868 × 105 km$$^2$$.

### Overlays

With your hexagons and cyclone locations having the same projection, next you obtain the maximum intensity and number of observations per grid. The function over() combines points (or grids) and polygons by performing point-in-polygon operations on all point-polygon combinations.

You obtain a vector containing the hexagon identification number for each hourly cyclone observation and then add this vector as a column to W.sdf.

hexid = over(x = W.sdf, y = hpg)
W.sdf$hexid = hexid  The length of the vector is the number of hourly observations. Next, compute per hexagon statistics including maximum intensity and number of cyclone hours. require(plyr) perHexStats = ddply(W.sdf@data, .(hexid), summarize, count = length(Sid), hct = length(unique(Sid)), WmaxS = max(WmaxS))  Finally, create a spatial polygons data frame from the hexagons with per hexagon statistics. ID = paste("ID", perHexStats$hexid, sep = "")
row.names(perHexStats) = ID
hspdf = SpatialPolygonsDataFrame(hpg[ID], perHexStats, match.ID = TRUE)

##      hexid count hct WmaxS
## ID4      4     9   1 23.32
## ID8      8    18   2 56.42
## ID9      9    17   1 43.49
## ID10    10    14   1 29.76
## ID11    11    14   1 23.35
## ID12    12    14   1 23.01


Here you list the first six rows of the data frame that correspond to the first six hexagon grids.

### Choropleth maps

You now have a spatial polygons data frame with each polygon as an equal-area hexagon on a LCC projection with attributes (frequency and intensity) in the data slot. Choropleth maps of cyclone attributes are created using the spplot() method.

Before mapping you create a couple of lists that are used by the sp.layout argument. One specifies the set of hourly locations from the spatial points data frame and assigns values to plot arguments.

l1 = list("sp.points", W.sdf, pch = 20, col = "gray", cex = 0.3)


Another specifies the coastlines as a spatial lines object. First acquire the maps and maptools packages. The first package contains the map() function to generate an object of country borders and the second contains the map2SpatialLines() conversion function.

The conversion is made to geographic coordinates, which are then transformed to the LCC projection of your above spatial objects. You set geographic coordinate limits on the map domain to limit the amount of conversion and projection calculations.

require(maps)
require(maptools)
cl = map("world", xlim = c(-120, 20), ylim = c(-10, 70), plot = FALSE)
clp = map2SpatialLines(cl, proj4string = CRS(ll))
clp = spTransform(clp, CRS(lcc))
l2 = list("sp.lines", clp, col = "gray")


Depicting attribute levels on a map is done using a color ramp. A color ramp creates a character vector of color hex codes. The number of colors is specified by the argument in the color ramp function.

Numerous color ramp functions are available in the colorRamps() package (Keitt 2009). Acquire the package and assign 20 colors to the vector cr using the blue to yellow color ramp.

require(colorRamps)

## Loading required package: colorRamps

cr = blue2yellow(20)


If the number of colors is less than the number of levels the colors get recycled. A map of the cyclone frequency is made by typing

require(grid)
p1 = spplot(hspdf, "count", col = "white", col.regions = blue2yellow(20), sp.layout = list(l1,
l2), at = seq(0, 140, 20), colorkey = list(space = "bottom", labels = list(labels = paste(seq(0,
140, 20)), cex = 0.8)), sub = list("Cyclone hours", cex = 0.8, font = 1))
p2 = spplot(hspdf, "WmaxS", col = "white", col.regions = blue2red(20), sp.layout = list(l2),
at = seq(10, 90, 10), colorkey = list(space = "bottom", labels = list(labels = paste(seq(10,
90, 10)), cex = 0.8)), sub = list("Highest intensity [m/s]", cex = 0.8,
font = 1))
p1 = update(p1, main = textGrob("a", x = unit(0.05, "npc")))
p2 = update(p2, main = textGrob("b", x = unit(0.05, "npc")))
print(p1, split = c(1, 1, 1, 2), more = TRUE)
print(p2, split = c(1, 2, 1, 2), more = FALSE)


Areas along the southeastern coastline of the United States have the greatest number of cyclone hours while the region from the western Caribbean into the Gulf of Mexico has the highest intensities.

## SST Data

The effort needed to create track grids pays off nicely when you add regional climate data. Here you use sea-surface temperature (SST) from July 2005 as an example. July values indicate conditions occurring before the active part of the North Atlantic hurricane season.

Earlier you used functions in the ncdf package along with some additional code to extract a data frame of SST consisting of monthly values at the intersections of parallels and meridians at 2$$^\circ$$ intervals. We return to these data here. Input them by typing

con = "http://www.hurricaneclimate.com/storage/chapter-9/JulySST2005.txt"

##     SST  lon lat
## 1 24.13 -100   0
## 2 23.96  -98   0
## 3 23.76  -96   0
## 4 23.53  -94   0
## 5 23.44  -92   0
## 6 23.60  -90   0


The data are listed in the column labeled SST. You treat these values as point data because they have longitudes and latitudes although they are regional averages.

At 25$$^\circ$$ N latitude the 2$$^\circ$$ spacing of the SST values covers an area of approximately the same size as the hexagon grids used in the previous section. The SST locations are converted to planar coordinates using the same LCC projection after the data frame is converted to a spatial data frame.

You need to first assign a projection string to the proj4string slot.

coordinates(sst) = c("lon", "lat")
proj4string(sst) = CRS(ll)
sst = spTransform(sst, CRS(lcc))


To examine the SST data as spatial points, type

spplot(sst, "SST", col.regions = rev(heat.colors(20)))


This produces a plot of the spatial points data frame, where the SST attribute is specified as a character string in the second argument.

Your interest is the average SST within each hexagon. Again, use the over() function to combine your SST values with the track polygons setting the argument fn to mean.

ssta = over(x = hspdf, y = sst, fn = mean)


The result is a data frame with a single column representing the average over all SST values in each hexagon. The row names are the hexagon numbers prefixed with ID, however, hexagons completely over land have missing SST values.

Next, you add the average SSTs as an attribute to spatial polygon data frame and then remove hexagons with missing SST values.

hspdf$sst = ssta$SST
hspdf = hspdf[!is.na(hspdf$sst), ] str(slot(hspdf, "data"))  ## 'data.frame': 86 obs. of 5 variables: ##$ hexid: int  4 8 9 10 11 12 13 21 22 23 ...
##  $count: int 9 18 17 14 14 14 1 25 55 11 ... ##$ hct  : int  1 2 1 1 1 1 1 1 1 1 ...
##  $WmaxS: num 23.3 56.4 43.5 29.8 23.4 ... ##$ sst  : num  28.8 28.9 29.1 29.1 28.8 ...


Finally you create the July SST map corresponding to where the cyclones occurred during the season by typing

spplot(hspdf, "sst", col = "white", col.regions = blue2red(20), sp.layout = list(l1,
l2), at = seq(14, 30, 2), colorkey = list(space = "bottom", labels = list(cex = 0.8)),
sub = list("Sea surface temperature [C]", cex = 0.8, font = 1))


Ocean temperatures exceed 26$$^\circ$$ C over a wide swath of the North Atlantic extending into the Caribbean Sea and Gulf of Mexico. Coldest waters are noted off the coast of Nova Scotia and Newfoundland.

## SST and Intensity

Analyzing and modeling your SST and cyclone data is easy with your spatial polygons data frame. For instance, the average maximum wind speed in regions where the SST exceeds 28$$^\circ$$ C is obtained by typing

mean(hspdf$WmaxS[hspdf$sst > 28])

## [1] 47.05


Here you treat your spatial data frame as you would a regular data frame. Continuing you create side-by-side box plots of wind speed conditional on the collocated values of SST by typing

boxplot(hspdf$WmaxS ~ hspdf$sst > 28)


Importantly, the spatial information allows you to analyze relationships on a map. Here you map the hexagons colored by groups defined by a two-way table of cyclone intensity and SST using above and below median values.

mT = median(hspdf$sst) mI = median(hspdf$WmaxS)
hspdf$group = ifelse(hspdf$sst >= mT & hspdf$WmaxS >= mI, 4, ifelse(hspdf$sst >=
mT & hspdf$WmaxS < mI, 3, ifelse(hspdf$sst <= mT & hspdf$WmaxS >= mI, 2, ifelse(hspdf$sst <= mT & hspdf$WmaxS < mI, 1, 0)))) mycolors = c("blue", "cyan", "magenta", "red") txt = list(c("Low SST, Low intensity", "Low SST, High intensity", "High SST, Low intensity", "High SST, High intensity")) print(spplot(hspdf, "group", col = "white", col.regions = mycolors, sp.layout = list(l1, l2), colorkey = F, key = list(space = "bottom", rect = list(col = mycolors), text = txt, cex = 0.7)))  Red hexagons indicate regions of high intensity and high ocean temperature and blue hexagons indicate regions of low intensity and low ocean temperature. More interesting are regions of mismatch. Magenta hexagons show low intensity coupled with relatively high ocean temperature indicating regions with 'under-performing' cyclones (cyclones weaker than the thermodynamic potential of their environment). Cyan hexagons show high intensity coupled with relatively low temperature indicating regions with 'over-performing' cyclones (cyclones stronger than the thermodynamic potential of their environment). Maps like these provide insight into the relationship between hurricanes and climate not accessible with basin-level analyzes. Next we show how you how to model these spatial data. We begin with a look at spatial correlation and then use a spatial regression model to quantify the regional change in intensity as a function of SST. ## Spatial Autocorrelation Values in neighboring hexagons will tend to be more similar than values in hexagons farther away. Spatial autocorrelation quantifies the degree of similarity in values across geographic space. ### Moran's I A measure of spatial autocorrelation is Moran's $$I$$ (Moran 1950) defined as $I = \frac{m}{s} \frac{y^TWy}{y^Ty}$ where $$m$$ is the number of hexagons, $$y$$ is the vector of values within each hexagon (e.g., cyclone intensities) where the values are deviations from the overall mean, $$W$$ is a weights matrix, $$s$$ is the sum over all the weights, and the subscript $$T$$ indicates the transpose operator. Values of Moran's $$I$$ range from$-$1 to$+$1 with a value of zero indicating a pattern with no spatial autocorrelation. Although not widely used in climate studies, deBeurs and Henebry (2008) use it to identify spatially coherent eco-regions and biomes related to the North Atlantic oscillation. To compute Moran's $$I$$ you need a weights matrix. The weights matrix is square with the number of rows and columns equal to the number of hexagons. Here the weight in row $$i$$, column $$j$$ of the matrix is assigned a non-zero value if hexagon $$i$$ is contiguous with hexagon $$j$$, otherwise it is assigned a zero. The spdep package (Bivand 2011) has functions for creating weights based on contiguity (and distance) neighbors. The process requires two steps. First, you use the poly2nb() function on your spatial polygons data frame to create a contiguity-based neighborhood list object. require(spdep) hexnb = poly2nb(hspdf) hexnb  ## Neighbour list object: ## Number of regions: 86 ## Number of nonzero links: 370 ## Percentage nonzero weights: 5.003 ## Average number of links: 4.302  The list is ordered by hexagon number starting with the southwest-most hexagon. It has 2 neighbors; hexagon numbers 8 and 9. Hexagon numbers increase to the east and north. A hexagon has at most six contiguous neighbors. Hexagons at the borders have fewer neighbors. A graph of the hexagon connectivity, here defined by the first-order contiguity, is made by typing plot(hexnb, coordinates(hspdf)) plot(hspdf, add = TRUE)  A summary method applied to the neighborhood list reveals the average number of neighbors and the distribution of connectivity among the hexagons. You turn the neighborhood list object into a listw object using the nb2listw() function that duplicates the neighborhood list and adds the weights. The style argument determines the weighting scheme. With the argument value set to W the weights are the inverse of the number of neighbors. For instance, the six neighbors of a fully connected hexagon each get a weight of 1/6. wts = nb2listw(hexnb, style = "W") summary(wts)  ## Characteristics of weights list object: ## Neighbour list object: ## Number of regions: 86 ## Number of nonzero links: 370 ## Percentage nonzero weights: 5.003 ## Average number of links: 4.302 ## Link number distribution: ## ## 1 2 3 4 5 6 ## 2 7 20 15 18 24 ## 2 least connected regions: ## ID13 ID170 with 1 link ## 24 most connected regions: ## ID39 ID40 ID41 ID57 ID58 ID74 ID75 ID80 ID90 ID91 ID92 ID93 ID96 ID97 ID98 ID101 ID112 ID113 ID114 ID115 ID116 ID131 ID132 ID133 with 6 links ## ## Weights style: W ## Weights constants summary: ## n nn S0 S1 S2 ## W 86 7396 86 45.04 349.4  Now you are ready to compute the value of Moran's $$I$$. This is done using the moran() function. The first argument is the variable of interest followed by the name of the listw object. Also needed is the number of hexagons and the global sum of the weights, which is obtained using the Szero() function. m = length(hspdf$WmaxS)
s = Szero(wts)
moran(hspdf$WmaxS, wts, n = m, S0 = s)  ##$I
## [1] 0.556
##
## $K ## [1] 2.656  The function returns the value for Moran's $$I$$ and the sample kurtosis. Kurtosis is a measure of the peakedness of the distribution. A normal distribution has a kurtosis of 3. The value of 0.56 indicates fairly high spatial autocorrelation in cyclone intensity. This is expected since the strength of a cyclone as it moves from one hexagon to the next does not vary much and because stronger cyclones tend to occur at lower latitudes. ### Spatial lag variable Insight into the Moran's $$I$$ statistic is obtained by noting that it is equivalent to the slope coefficient in a regression model of $$Wy$$ on $$y$$, where $$Wy$$ is called the spatial lag variable. Let y be the set of cyclone intensities in each hexagon, then you create a spatial lag intensity variable using the lag.listw() function. y = hspdf$WmaxS
Wy = lag.listw(wts, y)


For each cyclone intensity value in the vector object y there is a corresponding value in the vector object Wy representing the mean intensity over the neighboring hexagons.

The neighbor average does not include the value in y, so for a completely connected hexagon the average is taken over the adjoining six neighboring values.

A scatter plot of the neighborhood average intensity versus the intensity in each hexagon shows the spatial autocorrelation relationship. The slope of a least-squares regression line through the points is the value of Moran's $$I$$.

Use the code below to create a Moran's scatter plot.

par(las = 1, pty = "s")
plot(y, Wy, pch = 20, xlab = "Intensity (m/s)", ylab = "Neighborhood Avg Intensity (m/s)")
abline(lm(Wy ~ y), lwd = 2, col = "red")
grid()


The plot indicates spatial autocorrelation. High intensity hexagons tend to be surrounded, on average, by other hexagons with high intensity and vice versa as evidenced by the upward slope of the regression line.

### Statistical significance

The expected value of Moran's $$I$$ under the hypothesis of no spatial autocorrelation is $E(I) = \frac{-1}{m-1}$ where $$m$$ is the number of hexagons. This allows you to test the significance of your sample Moran's $$I$$.

The test is available using the moran.test() function. The first argument is the vector of intensities and the second is the spatial weights matrix in the weights list form.

moran.test(y, wts)

##
##  Moran's I test under randomisation
##
## data:  y
## weights: wts
##
## Moran I statistic standard deviate = 7.436, p-value = 5.173e-14
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance
##          0.555961         -0.011765          0.005828


The output shows the standard deviate computed by taking the difference between the estimated $$I$$ and it's expected value under the null hypothesis of no autocorrelation.

The difference is divided by the square root of the difference between the variance of $$I$$ and the square of the mean of $$I$$. The $$p$$-value is the chance of observing a standard deviate this large or larger assuming there is no spatial autocorrelation.

The $$p$$-value is small leading you to conclude there is significant autocorrelation. The output also includes the value of Moran's $$I$$ along with its expected value and variance.

By default the variance of $$I$$ is computed by randomizing the intensities across the hexagon tiling. If the values are normally distributed, a version of the test that has more statistical power (higher probability that the test will reject the null hypothesis of no autocorrelation when there is, in fact, autocorrelation) by adding the argument random=FALSE.

Moran's $$I$$ and the corresponding significance test are sensitive to the definition of neighbors and to the neighborhood weights, so conclusions should be stated as conditional on your definition of neighborhoods.

## Spatial Regression Models

Spatial regression models take advantage of spatial autocorrelation. If significant autocorrelation exists, spatial regression models have parameters that are more stable and statistical tests that are more reliable than non-spatial alternatives.

Confidence intervals on a regression slope will have the proper coverage probabilities and prediction errors will be smaller. Autocorrelation is included in a regression model by adding a spatial lag dependent variable or by including a spatially correlated error term (Anselin et al. 2006).

Spatial autocorrelation can also enter a model indirectly by allowing the relationship between the response and the explanatory variable to vary across the domain. This is called geographically-weighted regression (Brunsdon et al. 1998, Fotheringham et al. 2000).

Geographically-weighted regression (GWR) allows you to see where an explanatory variable contributes strongly to the relationship and where it contributes weakly. It is similar to a local linear regression.

Here you compare a standard linear regression model of intensity on SST with a local linear regression of the same relationship by typing

x = hspdf$sst par(las = 1, pty = "s") plot(x, y, pch = 20, xlab = "SST (C)", ylab = "Intensity (m/s)") abline(lm(y ~ x), lwd = 2) lines(loess.smooth(x, y, span = 0.85), lwd = 2, col = "red")  With the local regression the relationship between intensity and SST changes for different values of SST. With lower values of SST the relationship has a gentle slope and with higher values of SST the relationship has a steeper slope. In contrast, the 'global' linear regression results in a single moderate slope across all values of SST. Fitting is done locally in the domain of SST. That is for a point $$s$$ along the SST axis, the fit is made using points in a neighborhood of $$s$$ weighted by their distance to $$s$$. The size of the neighborhood is controlled by the span argument. With span=.85 the neighborhood includes 85% of the SST values. With GWR this localization is done in the domain of geographic space. For example, a regression of cyclone intensity on SST is performed using the paired intensity and SST values at each of the 86 hexagons. For each hexagon the weight associated with a paired value in another hexagon is inversely proportional to physical distance between the two hexagons. In this way the relationship between intensity and SST is localized. ### Linear regression The standard regression model consists of a vector $$y$$ of response values and a matrix $$X$$ containing the set of explanatory variables plus a row vector of one's. The relationship is modeled as $y = X\beta+ \varepsilon$ where $$\beta$$ is a vector of regression coefficients and $$\varepsilon \sim N(0, \sigma^2I)$$ is a vector of independent and identically distributed residuals with variance $$\sigma^2$$. The maximum likelihood estimate of $$\beta$$ is given by $\hat \beta = (X^TX)^{-1}X^T y.$ You begin with a linear regression of cyclone intensity on SST across the set of hexagons. Although your interest is the relationship between intensity and SST you know that cyclone intensity within the hexagon will likely also depend on the number of cyclone hours. In general a hexagon with more cyclone hours will have a higher intensity. Thus your regression model includes SST and cyclone hours as explanatory variables in which case the SST coefficient from the regression is an estimate of the effect of SST on intensity after accounting for cyclone occurrence. You use the lm() function to create a linear regression model. The summary() method is used to obtain statistical information about the model. model = lm(WmaxS ~ sst + count, data = hspdf) summary(model)  ## ## Call: ## lm(formula = WmaxS ~ sst + count, data = hspdf) ## ## Residuals: ## Min 1Q Median 3Q Max ## -24.74 -11.12 0.29 9.11 33.46 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -11.2189 13.1775 -0.85 0.3970 ## sst 1.6999 0.5068 3.35 0.0012 ** ## count 0.1682 0.0544 3.09 0.0027 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 13.6 on 83 degrees of freedom ## Multiple R-squared: 0.274, Adjusted R-squared: 0.256 ## F-statistic: 15.6 on 2 and 83 DF, p-value: 1.74e-06  The formula indicates that the mean of WmaxS is related to sst and count. SST and cyclone hours are both important in statistically explaining cyclone intensity across the set of hexagons. The parameter on the SST variable is interpreted to mean that for every 1$$^\circ$$ C increase in sea-surface temperature cyclone intensity goes up by 1.7 m s$$^{-1}$$ after accounting for cyclone hours. The model explains 27.4% of the variation in cyclone intensity. These results represent the average relationship over the basin. ### Geographically-weighted regression The SST parameter is replaced by a vector of parameters in GWR one for each hexagon. The relationship between the response vector and the explanatory variables is expressed as $y = X\beta(g)+ \varepsilon$ where $$g$$ is a vector of geographic locations, here the set of hexagons with cyclone intensities and $\hat \beta(g) = (X^TWX)^{-1}X^TW y$ where $$W$$ is a weights matrix given by $W =\exp(-D^2/h^2)$ where $$D$$ is a matrix of pairwise distances between the hexagons and $$h$$ is the bandwidth. The elements of the weights matrix, $$w_{ij}$$, are proportional to the influence hexagons $$j$$ have on hexagons $$i$$ in determining the relationship between $$X$$ and $$y$$. Weights are determined by an inverse-distance function (kernel) so that values in nearby hexagons have greater influence on the local relationship compared with values in hexagons farther away. The bandwidth controls the amount of smoothing. It is chosen as a trade-off between variance and bias. A bandwidth too narrow (steep gradients on the kernel) results in large variations in the parameter estimates (large variance). A bandwidth too wide leads to a large bias as the parameter estimates are influenced by processes that do not represent the local conditions. Functions for selecting a bandwidth and running GWR are available in the spgwr package (Bivand and Yu 2011). First acquire the package and select the bandwidth using the gwr.sel() function. The first argument is the model formula and the second is the data frame. require(spgwr) bw = gwr.sel(WmaxS ~ sst + count, data = hspdf)  ## Bandwidth: 3421592 CV score: 15223 ## Bandwidth: 5530725 CV score: 16079 ## Bandwidth: 2118076 CV score: 13837 ## Bandwidth: 1312459 CV score: 12512 ## Bandwidth: 814560 CV score: 12297 ## Bandwidth: 831650 CV score: 12272 ## Bandwidth: 1009764 CV score: 12164 ## Bandwidth: 1125383 CV score: 12244 ## Bandwidth: 989429 CV score: 12162 ## Bandwidth: 984614 CV score: 12162 ## Bandwidth: 986176 CV score: 12162 ## Bandwidth: 986182 CV score: 12162 ## Bandwidth: 986181 CV score: 12162 ## Bandwidth: 986181 CV score: 12162 ## Bandwidth: 986181 CV score: 12162 ## Bandwidth: 986181 CV score: 12162 ## Bandwidth: 986181 CV score: 12162 ## Bandwidth: 986181 CV score: 12162  bw * 0.001  ## [1] 986.2  The procedure is an iterative optimization with improvements made based on previous values of the bandwidth. Values for the bandwidth and the cross-validation score are printed. After several iterations, no improvement in the score occurs. The bandwidth has dimensions of length representing a spatial distance. The units of the LCC projection in the spatial polygons data frame is meters so to convert the bandwidth to kilometers you multiply by 10$$^{-3}$$. GWR is performed using the gwr() function. The first two arguments are the same as in the function to select the bandwidth. The value of the bandwidth is supplied with the bandwidth argument. model = gwr(WmaxS ~ sst + count, data = hspdf, bandwidth = bw) model  ## Call: ## gwr(formula = WmaxS ~ sst + count, data = hspdf, bandwidth = bw) ## Kernel function: gwr.Gauss ## Fixed bandwidth: 986181 ## Summary of GWR coefficient estimates: ## Min. 1st Qu. Median 3rd Qu. Max. Global ## X.Intercept. -244.0000 -72.7000 5.5900 27.9000 62.3000 -11.22 ## sst -1.3100 0.1430 1.1600 4.0700 9.7700 1.70 ## count 0.0108 0.0936 0.1510 0.1860 0.2970 0.17  The output repeats the function call, which includes the form of the model, the kernel function (here Gaussian), and the bandwidth (units of meters). The output includes a summary of the regression parameter values across the hexagons. In general, intensity is positively related to SST, but the minimum parameter value indicates that in at least one hexagon the relationship is negative. The units on this regression parameter are m s$$^{-1}$$/$$^\circ$$ C. The summary also includes the parameter values from a standard regression under the column heading Global. These values are the same as output above using the lm() function. The object model$SDF inherits the spatial polygons data frame class from the model object along with the corresponding map projection. A summary method on this object provides additional information about the GWR.

summary(model$SDF)  ## Object of class SpatialPolygonsDataFrame ## Coordinates: ## min max ## x -4152910 4389727 ## y 1306230 5803139 ## Is projected: TRUE ## proj4string : ## [+proj=lcc +lat_1=60 +lat_2=30 +lon_0=-60 +ellps=WGS84] ## Data attributes: ## sum.w X.Intercept. sst count ## Min. : 4.94 Min. :-244.09 Min. :-1.310 Min. :0.0108 ## 1st Qu.:12.42 1st Qu.: -72.75 1st Qu.: 0.143 1st Qu.:0.0936 ## Median :15.66 Median : 5.59 Median : 1.165 Median :0.1511 ## Mean :15.20 Mean : -27.42 Mean : 2.194 Mean :0.1429 ## 3rd Qu.:18.73 3rd Qu.: 27.92 3rd Qu.: 4.067 3rd Qu.:0.1864 ## Max. :21.44 Max. : 62.27 Max. : 9.768 Max. :0.2974 ## gwr.e pred localR2 ## Min. :-27.228 Min. :23.2 Min. :0.234 ## 1st Qu.: -3.684 1st Qu.:31.1 1st Qu.:0.315 ## Median : 0.576 Median :38.3 Median :0.389 ## Mean : -0.204 Mean :40.5 Mean :0.416 ## 3rd Qu.: 6.210 3rd Qu.:49.1 3rd Qu.:0.504 ## Max. : 25.188 Max. :67.8 Max. :0.705  Here the coordinate bounding box is given along with the projection information. Data attributes are stored in the data slot. The attributes include the model coefficients (including the intercept term), the sum of the weights, predicted values, prediction errors, and R-squared values. A histogram of the R-squared values is made by typing hist(model$SDF$localR2, main = "", xlab = "Local R-Squared Values")  The values are centered near a value 0.5, but there are 0 hexagons with a value above 0.9. Additional insight is obtained by mapping the results. For instance, to make a choropleth map of the SST parameter values, type model = gwr(WmaxS ~ sst + count, data = hspdf, bandwidth = bw, hatmatrix = TRUE) p1 = spplot(model$SDF, "sst", col = "white", col.regions = blue2red(10), at = seq(-25,
25, 5), sp.layout = list(l2), colorkey = list(space = "bottom", labels = list(cex = 0.8)),
sub = list("SST effect on intensity [m/s/C]", cex = 0.8, font = 1))
model$SDF$sig = model$SDF$sst/model$SDF$sst_se
al = colorRampPalette(c("blue", "white", "red"), space = "Lab")
p2 = spplot(model$SDF, "sig", col = "white", col.regions = al(10), at = seq(-5, 5), sp.layout = list(l2), colorkey = list(space = "bottom", labels = list(cex = 0.8)), sub = list("Significance of effect ($t$-value)", cex = 0.8, font = 1)) plot(update(p1, main = textGrob("a", x = unit(0.05, "npc"))), split = c(1, 1, 1, 2), more = TRUE) plot(update(p2, main = textGrob("b", x = unit(0.05, "npc"))), split = c(1, 2, 1, 2), more = FALSE)  The marginal influence of SST on cyclone intensity is shown along with corresponding $$t$$ values. The $$t$$ value is calculated as the ratio of the parameter value to its standard error. Large values of $$|t|$$ indicate a statistical significant relationship. Standard errors are available by specifying the argument hatmatrix=TRUE. Hexagons are colored according to the SST coefficient value. The SST coefficient represents a local trend' of intensity as a function of SST holding cyclone occurrence constant. Hexagons with positive coefficients, indicating a direct relationship between cyclone strength and ocean warmth in m s$$^{-1}$$/$$^\circ$$ C, are displayed using red hues and those with negative coefficients are shown with blue hues. The divergent color ramp blue2red() creates the colors. Hexagons with the largest positive coefficients (greater than 5 m s$$^{-1}$$/$$^\circ$$ C) are found over the Caribbean Sea extending into the southeast Gulf of Mexico and east of the Lesser Antilles. Coefficients above zero extend over much of the Gulf of Mexico northeastward into the southwestern Atlantic. A region of coefficients less than zero is noted over the central North Atlantic extending from the middle Atlantic coast of the United States eastward across Bermuda and into the central Atlantic. Local statistical significance is estimated by dividing the coefficient value by its standard error. The ratio, called the $$t$$ value, is described by a $$t$$ distribution under the null hypothesis of a zero coefficient value. Regions of high $$t$$ values (absolute value greater than 2) denote areas of statistical significance and generally correspond with regions of large upward trends including most of the Caribbean sea and the eastern Gulf of Mexico. Regions with negative trends over the central Atlantic extending to the coastline fail to show significance. To some degree these results depend on the size of your hexagons (modifiable areal unit problem). One strategy is to rerun the GWR model with larger and smaller hexagons. In this way you can check how much the results change and whether the changes influence your conclusions. ### Model fit To assess model fit, you examine the residuals. Here a residual is the difference between observed and predicted intensity in each hexagon. Residuals above zero indicate the model under predicts the intensity and residuals below zero indicate the model over predicts intensity. Residuals are saved in model$SDF$gwr.e. A histogram of the residuals shows the values are centered on zero, but with a long left tail. The long left tail indicates the model anomalously over predicts intensity in some hexagons. rsd = model$SDF$gwr.e hist(rsd, main = "", xlab = "Residual (m/s)") hist(rsd[rsd <= -10], add = TRUE, col = "brown")  An interesting question concerns the location of these anomalous predictions. This is answered by coloring the hexagons having the largest negative residuals. First create a new vector for the spatial polygons data frame consisting of 1's indicating largest negative residuals and 0's indicating otherwise. Then use the spplot() method and restrict the color region argument to “brown” and “white” and set the color key to false. model$SDF$op = as.integer(rsd <= (-10)) spplot(model$SDF, "op", col = "white", col.regions = c("white", "brown"), sp.layout = list(l2),
colorkey = FALSE)


In general, the largest over predictions occur at low latitudes in hexagons near and over land. This makes sense; in these regions, although SST values are high, cyclone intensities are limited by influences associated with land like drier air and greater surface friction.

The map suggests a way the model can be improved. A factor indicating the presence or absence of land or a covariate indicating the proportion of hexagon covered by land would be a reasonable thing to try.

Other types of spatial regression models fit this hexagon framework. For instance if interest is cyclone counts then a Poisson spatial model Jagger et al. (2001) can be used. And if interest is prediction rather than explanation a model that includes spatial autocorrelation through a correlated error term or a lag response variable is possible.

Later we show you how to use the hexagon framework to construct a space-time model for hurricane occurrence.

## Spatial Interpolation

Spatial data often needs to be interpolated. Cyclone rainfall is a good example. You know how much it rained at observing sites but not everywhere across the affected region. Spatial interpolation estimates the rainfall values at arbitrary locations. Estimates on a regular grid are used to create contour maps.

This can be done in a various ways. Here we show you how to do it statistically. Statistical interpolation is preferable because it includes uncertainty estimates. The procedure is called 'kriging' after a mining engineer.

A kriged estimate (prediction) of some variable $$z$$ at a given location is a weighted average of the $$z$$ values over the entire domain where the weights are proportional to the spatial correlation. The estimates are optimal in the sense that they minimize the variance between the observed and interpolated values.

In short, kriging involves estimating and modeling the spatial autocorrelation and then using the model together with the observations to interpolate.

Here you work through an example using rainfall from tropical cyclone Fay in 2008. Fay formed near the Dominican Republic as a tropical wave, passed over the island of Hispaniola and Cuba before making landfall on the Florida Keys. Fay then crossed the Florida peninsula and moved westward across portions of the Florida panhandle producing heavy rains in parts of the state.

### Preliminaries

Here you use a spatial interpolation model to generate a continuous isohyet surface of rainfall totals from Fay. The data are in FayRain.txt and are a compilation of reports from the U.S. NOAA/NWS official weather sites and coop sites.

The coop sites are the Community Collaborative Rain, Hail and Snow Network (CoCoRaHS), a community-based, high density precipitation network made up of volunteers who take measurements of precipitation in their backyards. The data were obtained from NOAA/NCEP/HPC and from the Florida Climate Center.

You make use of functions in the gstat package (Pebesma 2004). Acquire the package and read in the data by typing

require(gstat)
con = "http://www.hurricaneclimate.com/storage/chapter-9/FayRain.txt"
names(FR)

## [1] "lon" "lat" "tpi" "tpm"


The data frame contains 803 rainfall sites. Longitude and latitude coordinates of the sites are given in the first two columns and total rainfall in inches and millimeters are given in the second two columns.

Create a spatial points data frame by specifying columns that contain the spatial coordinates. Then assign a geographic coordinate system and convert the rainfall from millimeters to centimeters.

coordinates(FR) = c("lon", "lat")
proj4string(FR) = CRS(ll2)
FR$tpm = FR$tpm/10
summary(FR$tpm)  ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.00 8.31 15.80 17.60 24.40 60.20  The median value is 15.8 cm and the maximum is 60.2 cm. Next import the Florida shapefile containing the county polygons. You use the readShapeSpatial() function from the maptools package. require(maptools) tmp = download.file("http://hurricaneclimate.com/storage/chapter-9/FL.zip", "FL.zip", mode = "wb") unzip("FL.zip") FLpoly = readShapeSpatial("FL/FL", proj4string = CRS(ll2))  Next create a character string specifying the tags for a planar projection and transform the geographic coordinates of the site locations and map polygons to the projected coordinates. Here you use Albers equal-area projection with true scale at latitudes 23 and 30$$^\circ$$ N and include a tag to specify meters as the unit of distance. require(rgdal) aea = "+proj=aea +lat_1=23 +lat_2=30 +lat_0=26, +lon_0=-83 +units=m" FR = spTransform(FR, CRS(aea)) FLpoly = spTransform(FLpoly, CRS(aea))  A map of the rainfall sites and storm totals with the state boundaries is made by typing l3 = list("sp.polygons", FLpoly, lwd = 0.3, first = FALSE) spplot(FR, "tpm", sp.layout = l3)  Two areas of extremely heavy rainfall are noted. One running north-south along the east coast and another across the north. Rainfall collection sites are clustered in urban areas. ### Empirical variogram Rainfall is an example of geostatistical data. In principle it can be measured anywhere, but typically you have values at a sample of sites. The pattern of sites is not of much interest as it is a consequence of constraints (convenience, opportunity, economics, etc) unrelated to the phenomenon. Your interest here centers on inference about how much rain fell across the region. This is done using kriging—the cornerstone of geostatistics (Cressie 1993). The tradition with kriging is to model the spatial correlation using a variogram rather than a correlation function (correlogram). The sample variogram is given as $\hat \gamma(h) = \frac{1}{2N(h)}\sum^{N(h)}_{i,j} (z_i -z_j)^2$ where $$N(h)$$ is the number of distinct pairs of observation sites a lag distance $$h$$ apart, and $$z_i$$ and $$z_j$$ are the rainfall totals at sites $$i$$ and $$j$$. Note that the $$h$$ is an approximate distance used with a lag tolerance $$\delta h$$. The assumption is that the rainfall field is stationary. This means that the relationship between rainfall at two locations depends only on the relative positions of the sites and not on where the sites are located. This relative position is refered to as lag distance. You further assume the variance of rainfall between sites depends only on their lag distance and not their orientation relative to one another (isotropy assumption). Smaller variances are expected between nearby sites compared with variances between more distant sites. The variogram is the inverse of the spatial correlation. You expect larger correlation in rainfall amounts between sites that are nearby and smaller correlation between sites farther apart. This means for instance that if a site has a large rain total, nearby sites will also tend to have large totals (small difference), but farther away the totals decrease. By definition $$\gamma(h)$$ is the semivariogram and $$2\gamma(h)$$ is the variogram. However, for conciseness $$\gamma(h)$$ is often referred to as the variogram. The variogram plots the semivariance as a function of lag distance. Since your rainfall values have units of centimeters, the units of the semivariance are cm$$^2$$. The empirical variogram is computed using the variogram() function. The first argument is the model formula specifying the rainfall column from the data frame and the second argument is the data frame name. Here ~ 1 in the model formula indicates no covariates or trends in the data. Trends can be included by specifying coordinate names (e.g., ~ lon + lat). Recall that although your data have a planar projection, the coordinate names are not changed from the original data set. You compute the empirical variogram for Fay's rainfall and save it by typing v = variogram(tpm ~ 1, data = FR)  To see the variogram values as a function of lag distance use the plot method on the variogram object. Or here you use the regular plot method and add text indicating the number of point pairs for each lag distance. par(las = 1) plot(v$dist/1000, v$gamma, xlim = c(0, 400), ylim = c(0, 220), xlab = "Lagged distance (h) [km]", ylab = expression(paste("Semivariance (", gamma, ") [", cm^2, "]")), las = 1, pch = 19) grid() points(v$dist/1000, v$gamma, pch = 19) text(v$dist/1000, v$gamma, pos = 1, labels = as.character(v$np), cex = 0.5)


Values start low (50 cm$$^2$$) at short lag distance, then increase to over 200 cm$$^2$$ at lag distance of about 200 km. Data are from 803 observations over Florida and adjacent regions.

The zero-lag semivariance is called the 'nugget' and the semivariance level where the variogram values no longer increase is called the 'sill.' The lag distance to the sill is called the 'range.' These three parameters (nugget, sill, and range) are used to fit a model to the variogram.

Semivariances are calculated using all pairs of rainfall observations within a lag distance (plus a lag tolerance). The number of pairs (indicated below each point on the plot) varies with lag distance. There are many pairs at short range but fewer at longer ranges.

You check on the assumption of isotropy by plotting variograms at various azimuths. For example, separate variograms are estimated for four directions (0, 45, 90, and 135$$^\circ$$) using the argument alpha, where 0$$^\circ$$ is north-south.

These directional variograms restrict observational pairs to those having similar relative orientation within an angle tolerance (think of a pie slice). If the directional variograms are similar, then the assumption of isotropy is likely valid.

### Variogram model

Next you fit a model to the empirical variogram. The variogram model is a mathematical relationship defining the semivariance as a function of lag distance.

There are several choices for model type (functions). To see the selection of functions as a trellis plot, type show.vgms(). The nugget is shown as an open circle. The Gaussian function ('Gau') appears reasonable for your empirical variogram since the variogram increases slowly at short lag distances then more rapidly at larger distances before leveling off at the sill.

You save the function and initial parameter values in a variogram model object by typing

vmi = vgm(model = "Gau", psill = 150, range = 200 * 1000, nugget = 50)


The psill argument is the partial sill as the difference between the sill and the nugget. You get eyeball estimates of the parameter values from your empirical variogram.

You then use the fit.variogram() function to fit the model to the empirical variogram. Specifically, given the Gaussian function together with initial parameter values, the default method of weighted least squares improves the parameter estimates.

Ordinary least squares is not appropriate as the semivariances are correlated across the lag distances and the precision on the estimates varies depending on the number of site pairs for a given lag.

v.fit = fit.variogram(v, vmi)
v.fit

##   model  psill  range
## 1   Nug  46.81      0
## 2   Gau 157.08 128666


The result is a variogram model with a nugget of 46.8 cm$$^2$$, a partial sill of 157.1 cm$$^2$$, and a range of 128.7 km.

You plot the variogram model on top of the empirical variogram by typing

plot(v$dist/1000, v$gamma, xlim = c(0, 400), ylim = c(0, 220), xlab = "Lag distance (h) [km]",
ylab = expression(paste("Semivariance (", gamma, ") [", cm^2, "]")), las = 1,
pch = 19)
grid()
nug = v.fit$psill[1] ps = v.fit$psill[2]
r = v.fit\$range[2]/1000
h = seq(0, 400, 0.2)
fit = ps * (1 - exp(-h^2/(r^2))) + nug
lines(h, fit, lwd = 2)


Let $$r$$ be the range, $$c$$ the partial sill and $$c_o$$ the nugget, then the equation defining the curve over the set of lag distances $$h$$ is $\gamma(h)=c\left(1-\exp\left(-\frac{h^2}{r^2}\right)\right)+c_o$

There are a variety of variogram functions. You can try the spherical function by replacing the model=“Gau” with model=“Sph” in the vgm() function above.

vmi2 = vgm(model = "Sph", psill = 150, range = 200 * 1000, nugget = 50)
v2.fit = fit.variogram(v, vmi2)
plot(v, v2.fit)


The geoR package contains the eyefit() function that can make choosing a function easier however, the interpolated values are not overly sensitive assuming a reasonable choice.

### Kriging

The final step is to use the variogram model together with the rainfall values at the observation sites to create an interpolated surface. The process is called kriging. As Edzer Pebesma notes, 'krige' is to 'kriging' as 'predict' is to 'predicting.' Here you use ordinary kriging as there are no spatial trends in the rainfall. Universal kriging is used when trends are present.

Interpolation is done using the krige() function. The first argument is the model specification and the second is the data. Two other arguments are needed. One is the variogram model using the argument name model and the other is a set of locations identifying where the interpolations are to be made. This is specified with the argument name newdata.

Here you interpolate first to locations (point kriging) on a regular grid and then to the county polygons (block kriging). To create a grid of locations within the boundary of Florida type

grd = spsample(FLpoly, n = 5000, type = "regular")


You specify the number of locations using the argument n. The actual number will be slightly different because of the irregular boundary. To view the locations use the plot method on the grid object.

Additional plotting options are available if the spatial points object is converted to a spatial pixels object.

grd = as(grd, "SpatialPixels")


First use the krige() function to interpolate (predict) the observed rainfall to the grid locations. For a given location, the interpolation is a weighted average of the rainfall across the entire region where the weights are determined by the variogram model.

ipl = krige(tpm ~ 1, FR, newdata = grd, model = v.fit)

## [using ordinary kriging]


The function recognizes the type of kriging being performed. For instance, if the variogram model is not included then inverse distance weighted interpolation is performed. The function will not work if different values share the same location.

The saved object (ipl) inherits the spatial pixels object specified in the newdata argument, but extends it to a spatial pixels data frame by adding a data slot. The data slot is a data frame with two variables. The first var1.pred is the interpolated rainfall and the second var1.var is the prediction variance.

You plot the interpolated field using the spplot method. You specify an inverted topographic color ramp to highlight in blue regions with the highest rain totals.

spplot(ipl, "var1.pred", col.regions = rev(topo.colors(20)), sp.layout = l3)