\chapter{Data Sets}
\label{chap:datasets}
\SweaveOpts{keep.source=TRUE, pdf=FALSE, prefix.string=Chap06, grdevice=tikz.Swd}
<<options and clear workspace, echo=FALSE, eval=TRUE>>=
options(digits=3, show.signif.stars=FALSE, width=53)
rm(list=ls())
require(tikzDevice)
source("../SweaveTikZ.R")
@
%data files: read.csv("BTflat.csv"); read.table("gridboxes.txt"); read.csv("HS.csv", row.names=1, header=TRUE); load("landGrid.RData"); load("best.RData"); load("best.use.RData"); open.ncdf("sstna.nc")
%packages: lubridate, maps, ggplot2, survival, ncdf
%source code: source("interpolate.R"); source("savgol.R"); source("datasupport.R"); source("getmax.R"); source("getTracks.R")
%third party: NONE


\begin{quote} ``Data, data, data, I cannot make bricks without clay.''
\end{quote}
\indent---Sherlock Holmes\\
 
Hurricane data originate from careful analysis by operational meteorologists.  The data include estimates of the hurricane position and intensity at six-hourly intervals.  Information related to landfall time, local wind speeds, damages, deaths as well as cyclone size are included. The data are archived by season.

Effort is needed to make the data useful for climate studies.  In this chapter we describe the data sets used throughout this book.  We show you a work flow that includes import into R, interpolation, smoothing, and adding additional attributes.  We show you how to create useful subsets of the data.  Code in this chapter is more complicated and it can take longer to run.  You can skip this material on first reading and continue with model building starting in Chapter~\ref{chap:frequencymodels}.  You can return when you need an updated version of the data that includes the most recent years.

\section{Best-Tracks}

Most statistical models in this book use the best-track data.  Here we describe them and provide original source material.  We also explain how to smooth and interpolate them.  Interpolations and derivatives are needed for regional analysis.

\subsection{Description}

The {\it best-track} data set contains the six-hourly center locations and intensities of all known tropical cyclones across the North Atlantic basin including the Gulf of Mexico and Caribbean Sea.  The data set is called HURDAT for HURricane DATa.  It is maintained by the U.S. National Oceanic and Atmospheric Administration (NOAA) at the National Hurricane Center (NHC).

Center locations are given in geographic coordinates (in tenths of degrees) and the intensities, representing the one-minute near-surface ($\sim$ 10 m) wind speeds, are given in knots (1 kt = .5144 m~s$^{-1}$) and the minimum central pressures are given in millibars (1 mb = 1~hPa).  The data are provided in six-hourly intervals starting at 00 UTC (Universal Time Coordinate).  The version of HURDAT file used here contains cyclones over the period 1851 through 2010 inclusive.\footnote{From \url{www.nhc.noaa.gov/pastall.shtml#hurdat}, August 2011.}  Information on the history and origin of these data is found in \cite{JarvinenEtAl1984}.

The file has a logical structure that makes it easy to read with FORTRAN.  Each cyclone contains a header record, a series of data records, and a trailer record.  Original best-track data for the first cyclone in the file is shown below.

\begin{scriptsize}
\begin{verbatim}
00005 06/25/1851 M= 4  1 SNBR=   1 NOT NAMED   XING=1 SSS=1                     
00010 06/25*280 948  80    0*280 954  80    0*280 960  80    0*281 965  80    0*
00015 06/26*282 970  70    0*283 976  60    0*284 983  60    0*286 989  50    0*
00020 06/27*290 994  50    0*295 998  40    0*3001000  40    0*3051001  40    0*
00025 06/28*3101002  40    0*  0   0   0    0*  0   0   0    0*  0   0   0    0*
00030 HRBTX1
\end{verbatim}
\end{scriptsize}

The header (beginning with 00005) and trailer (beginning with 00030) records are single rows.  The header has eight fields.  The first field is the line number in intervals of five and padded with leading zeros.  The second is the start day for the cyclone in MM/DD/YYYY format.  The third is \verb@M= 4@ indicating four data records to follow before the trailer record.  The fourth field is a number indicating the cyclone sequence for the season, here 1 indicates the first cyclone of 1851.  The fifth field, beginning with \verb@SNBR=@, is the cyclone number over all cyclones and all seasons.  The sixth field is the cyclone name.  Cyclones were named beginning in 1950.  The seventh field indicates whether the cyclone made hit the United States with \verb@XING=1@ indicating it did and \verb@XING=0@ indicating it did not.  A hit is defined as the center of the cyclone crossed the coast on the continental United States as a tropical storm or hurricane. The final field indicates the Saffir-Simpson hurricane scale (1--5) impact in the United States based on the estimated maximum sustained winds at the coast.  The value 0 was used to indicate U.S. tropical storm landfalls, but has been deprecated.  

The next four rows contain the data records.  Each row has the same format.  The first field is again the line number.  The second field is the cyclone day in MM/DD format.  The next 16 fields are divided into four blocks of four fields each.  The first block is the 00 UTC record and the next three blocks are in six-hour increments (6, 12, and 18 UTC).  Each block is the same and begins with a code indicating the stage of the cyclone, tropical cyclone \verb@*@, subtropical cyclone \verb@S@, extratropical low \verb@E@, wave \verb@W@, and remanent low \verb@L@. The three digits immediately to the right of the cyclone stage code is the latitude of the center position in tenths of degree north (\verb@280@ is 28.0$^\circ$N) and the next four digits are the longitude in tenths of a degree west (\verb@948@ is 94.8$^\circ$W) followed by a space.  The third set of three digits is the maximum sustained (one minute) surface (10 m) wind speed in knots.  These are estimated to the nearest 10 kt for cyclones prior to 1886 and to 5 kt afterwards.  The final four digits after another space is the central surface pressure of the cyclone in mb if available.  If not the field is given a zero.  Central pressures are available for all cyclones after 1978.

The trailer has at least two fields.  The first field is again the line number.  The second field is the maximum intensity of the cyclone as a code using \verb@HR@ for hurricane, \verb@TS@ for tropical storm, and \verb@SS@ for subtropical storm.  If there are additional fields they relate to landfall in the United States.  The fields are given in groups of four with the first three indicating location by state and the last indicating the Saffir-Simpson scale based on wind speeds in the state.  Two-letter state abbreviations are used with the exception of Texas and Florida, which are further subdivided as follows: \verb@ATX@, \verb@BTX@, \verb@CTX@ for south, central, and north Texas, respectively and \verb@AFL@, \verb@BFL@, \verb@CFL@, and \verb@DFL@ for northwest, southwest, southeast, and northeast Florida, respectively.  An \verb@I@ is used as a prefix in cases where a cyclone has had a hurricane impact across a non-coastal state.

\subsection{Import}

The HURDAT file (e.g., {\it tracks.txt}) is appended each year with the set of cyclones from the previous season.  The latest version is available usually by late spring or early summer from \url{www.nhc.noaa.gov/pastall.shtml}.  Additional modifications to older cyclones are made when newer information becomes available. After downloading the HURDAT file we use a FORTRAN executable  for the Windows platform ({\bf BT2flat.exe}) to create a flat file ({\it BTflat.csv}) listing the data records.  The file is created by typing
\begin{verbatim}
BT2flat.exe tracks.txt > BTflat.csv
\end{verbatim}
The resulting comma separate flat file is read into R and the lines between the separate cyclone records removed by typing
<<read best track flat file, echo=TRUE, eval=TRUE>>=
best = read.csv("BTflat.csv")
best = best[!is.na(best[, 1]),]
@
Further adjustment are made to change the hours to ones, the longitude to degrees east, and the column name for the type of cyclone.
<<adjustments, echo=TRUE, eval=TRUE>>=
best$hr = best$hr/100
best$lon = -best$lon
east = best$lon < -180
best$lon[east] = 360 + best$lon[east]
names(best)[12] = "Type"
@
The first six lines of the data frame are shown here (\verb@head(best)@).
\begin{small}
\begin{verbatim}
  SYear Sn       name   Yr Mo Da hr  lat   lon Wmax pmin Type
1  1851  1 NOT NAMED  1851  6 25  0 28.0 -94.8   80    0    *
2  1851  1 NOT NAMED  1851  6 25  6 28.0 -95.4   80    0    *
3  1851  1 NOT NAMED  1851  6 25 12 28.0 -96.0   80    0    *
4  1851  1 NOT NAMED  1851  6 25 18 28.1 -96.5   80    0    *
5  1851  1 NOT NAMED  1851  6 26  0 28.2 -97.0   70    0    *
6  1851  1 NOT NAMED  1851  6 26  6 28.3 -97.6   60    0    *
\end{verbatim}
\end{small}
Note the 10~kt precision on the {\tt Wmax} column.  This is reduced to 5~kt from 1886 onward.

Unique cyclones in the data frame are identified by \verb@SYear@ and \verb@Sn@, but not by a single column identifier.  To make it easier to subset by cyclone you add one as follows.  First, use the \verb@paste@ function to create a character id string that combines the two columns.  Second, table the number of cyclone records with each character id and save these as an integer vector (\verb@nrs@).  Third, create a structured vector indexing the number of cyclones begining with the first one.  Fourth, repeat the index by the number of records in each cyclone and save the result in a \verb@Sid@ vector.
<<cyclone id, echo=TRUE, eval=TRUE>>=
id = paste(best$SYear, format(best$Sn), sep = ":")
nrs = as.vector(table(id))
cycn = 1:length(nrs)
Sid = rep(cycn, nrs[cycn])
@
%Sre = unlist(sapply(nrs, seq))

Next you create a column identifying the cyclone hours.  This is needed to perform time interpolations.  Begin by creating a character vector with strings identifying the year, month, day, and hour.  Note that first you need to take care of years when cyclones crossed into a new calendar year.  In the best-track file the year remains the year of the season. The character vector is turned into a POSIXlt object with the \verb@strptime@ function (see Chapter~\ref{chap:graphsandmaps}) with the time zone argument set to GMT (UTC).
<<date times, echo=TRUE, eval=TRUE>>=
yrs = best$Yr
mos = best$Mo
yrs[mos==1] = yrs[mos==1] + 1
dtc = paste(yrs, "-", mos, "-", best$Da, " ",
  best$hr, ":00:00", sep="")
dt = strptime(dtc, format="%Y-%m-%d %H:%M:%S", 
  tz="GMT")
@

Each cyclone record begins at either 0, 6, 12, or 18 UTC.  Retrieve those hours for each cyclone using the \verb@cumsum@ function and the number of cyclone records as an index.  Offsets are needed for the first and last cyclone.  Then sub sample the time vector obtained above at the corresponding values of the index and populate those times for all cyclone records.  Finally the cyclone hour is the time difference between the two time vectors in units of hours and is saved as a numeric vector \verb@Shour@.
<<cyclone hours, echo=TRUE, eval=TRUE>>=
i0 = c(1, cumsum(nrs[-length(nrs)]) + 1)
dt0 = dt[i0]
dt1 = rep(dt0, nrs[cycn])
Shour = as.vector(difftime(dt, dt1, units="hours"))
@
Finally, include the two new columns in the {\tt best} data frame.
<<add new columns, echo=TRUE, eval=TRUE>>=
best$Sid = Sid
best$Shour = Shour
dim(best)
@

The best-track data provides information on \Sexpr{length(cycn)} individual tropical cyclones over the period 1851--2010, inclusive.  The data frame you created contains these data in \Sexpr{dim(best)[1]} separate six-hourly records each having \Sexpr{dim(best)[2]} columns.  You can output the data as a spreadsheet using the \verb@write.table@ function.

If you want to send the file to someone that uses R or load it into another R session, use the \verb@save@ function to export it.  This exports a binary file that is imported back using the \verb@load@ function.
<<save and load, echo=TRUE, eval=FALSE>>=
save(best, file="best.RData")
load("best.RData")
@
Alternatively you might be interested in the functions available in the {\bf RNetCDF} and {\bf ncdf} packages for exporting data in Network Common Data Form.

\subsection{Intensification}

You add value to the data by computing intensification (and decay) rates.  The rate of change is estimated with a numerical derivative.  Here you use the Savitzky-Golay smoothing filter \citep{SavitzkyGolay1964} specifically designed for calculating derivatives.  The filter preserves the maximum and minimum cyclone intensities.  Moving averages will damp the extremes and derivatives estimated using finite differencing have larger errors.

The smoothed value of wind speed at a particular location is estimated using a local polynomial regression of degree three on a window of six values (including three locations before and two after).  This gives a window width of 1.5 days.  The daily intensification rate is the coefficient on the linear term of the regression divided by 0.25 since the data are given in quarter-day increments.  A third-degree polynomial captures most of the fluctuation in cyclone intensity without over-fitting to the random variations and consistent with the 5~kt precision of the raw wind speed.

The functions are available in {\bf savgol.R}.  Download the file from the book website and source it.  Then use the function \verb@savgol.best@ on your {\tt best} data frame saving the results back in {\tt best}.
<<smoothing, echo=TRUE, eval=TRUE>>=
source("savgol.R")
best = savgol.best(best)
@
The result is an appended data frame with two new columns {\tt WmaxS} and \verb@DWmaxDt@ giving the filtered estimates of wind speed and intensification, respectively.  The filtered speeds have units of knots to be consistent with the best-track winds and the intensification rates are in kt/hr.

As a comparison of your filtered and raw wind speeds you look at the results from Hurricane Katrina of 2005.  To make the code easier to follow, first save the rows corresponding to this cyclone in a separate object.
<<subset katrina, echo=TRUE, eval=TRUE>>=
Kt = subset(best, SYear == 2005 & 
  name == "KATRINA   ")
@
Next you plot the raw wind speeds as points and then overlay the filtered winds as a red line.
<<raw vs smoothed, echo=TRUE, eval=FALSE>>=
plot(Kt$Shour, Kt$Wmax, pch=16, xlab="Cyclone Hour",
  ylab="Wind Speed (m/s)")
lines(Kt$Shour, Kt$WmaxS, lwd=2, col="red")
@
The spaces in the name after \verb@KATRINA@ are important as the variable name is a fixed-length character vector.  On average the difference between the filtered and raw wind speeds is \Sexpr{round(sum(abs(Kt$Wmax-Kt$WmaxS))/dim(Kt)[1]*.5144,2)}~m~s$^{-1}$, which is below the roundoff of 2.5 m~s$^{-1}$ used in the archived data set.  Over the entire best-track data the difference is even less, averaging \Sexpr{round(sum(abs(best$Wmax-best$WmaxS))/dim(best)[1]*.5144,2)}~m~s$^{-1}$.  Importantly for estimating rates of change, the filtered wind speeds capture the maximum cyclone intensity.

<<lubridate, echo=FALSE, eval=TRUE>>=
require(lubridate)
Kt = subset(best, SYear == 2005 & 
  name == "KATRINA   ")
genesis = ymd(paste(Kt$Yr[1], Kt$Mo[1], Kt$Da[1]))
diss = ymd(paste(Kt$Yr[length(Kt$Yr)], Kt$Mo[length(Kt$Yr)], Kt$Da[length(Kt$Yr)]))
@
Figure~\ref{fig:intensification} shows filtered (red) and raw (circles) wind speeds and intensification rates (green) for Hurricane Katrina of 2005.   Cyclone genesis occurred at \Sexpr{Kt$hr[1]*100}~UTC on \Sexpr{wday(genesis, label=TRUE, abbr=FALSE)} \Sexpr{month(genesis, label=TRUE, abbr=FALSE)} \Sexpr{Kt$Da[1]}, \Sexpr{Kt$Yr[1]}.  The cyclone lasted for \Sexpr{Kt$Shour[length(Kt$Shour)]} hours (\Sexpr{Kt$Shour[length(Kt$Shour)]/24} days) dissipating at \Sexpr{Kt$hr[length(Kt$Yr)]*100}~UTC on \Sexpr{wday(diss, label=TRUE, abbr=FALSE)} \Sexpr{month(diss, label=TRUE, abbr=FALSE)} \Sexpr{Kt$Da[length(Kt$Yr)]}, \Sexpr{Kt$Yr[length(Kt$Yr)]}.  The maximum best-track wind speed of \Sexpr{round(Kt$Wmax[which.max(Kt$Wmax)]*.5144,1)}~m~s$^{-1}$ occurs at cyclone hour \Sexpr{Kt$Shour[which.max(Kt$Wmax)]}, which equals the smoothed wind speed value at that same hour.
\begin{figure}
\centering
<<intensification, fig=TRUE, echo=FALSE, eval=TRUE, width=4.5, height=3.4>>=
Kt = subset(best, SYear == 2005 & 
  name == "KATRINA   ")
w = Kt$Wmax * .5144
wS = Kt$WmaxS * .5144
int = Kt$DWmaxDt * .5144 * 24
par(las=1, mar=c(5, 4, 4, 3) + 1, mgp=c(2, .4, 0), tcl=-.3)
plot(Kt$Shour, w, pch=16, xaxt="n", xlab="Cyclone hour", 
  ylab="Wind speed [m s$^{-1}$]")
ind = seq(1, 31, 4)
axis(1, at=Kt$Shour[ind], label=Kt$Shour[ind])
lines(Kt$Shour, wS, col="red", lwd=2)
points(Kt$Shour, w, pch=16)
par(new=TRUE)
plot(Kt$Shour, int, type="l", col="green", xaxt="n", yaxt="n", xlab="", ylab="", lwd=2)
axis(4)
mtext("Intensification [m s$^{-1}$/day]", side=4, line=2.5, las=0)
legend("topleft", col=c("black", "red", "green"), lty=c(-1, 1, 1), pch=c(16,-1,-1),
   legend=c("Best-track speed", "Filtered speed", "Intensification"), 
   bg="white", cex=.6)
@
\vspace{-1.5cm}
\caption{Hurricane Katrina data.}
\label{fig:intensification}
\end{figure}

<<Andrew, echo=FALSE, eval=FALSE>>=
Ka = subset(best, SYear == 1992 & 
  name == "ANDREW    ")
w = Ka$Wmax * .5144
wS = Ka$WmaxS * .5144
int = Ka$DWmaxDt * .5144 * 24
par(las=1, mar=c(5, 4, 4, 4))
plot(Ka$Shour, w, pch=16, xlab="Cyclone Hour", ylab=expression(paste("Wind Speed [m ", s^-1,"]")))
lines(Ka$Shour, wS, col="red", lwd=2)
par(new=TRUE)
plot(Ka$Shour, int, type="l", col="green", xaxt="n", yaxt="n", xlab="", ylab="", lwd=2)
axis(4)
mtext(expression(paste("Intensification Rate [m ", s^-1,"/day]")), side=4, line=2.5, las=0)
legend("topleft", col=c("black", "red", "green"), lty=c(-1, 1, 1), pch=c(16, -1, -1),
   legend=c("Best-Track Speed", "Filtered Speed", "Intensification"))
@

\subsection{Interpolation}

For each cyclone, the observations are six hours apart.  For spatial analysis and modeling, this can be too coarse as the average forward motion of hurricanes is 6 m~s$^{-1}$ (12 kt).  You therefore fill-in the data using interpolation to one hour.  You also add an indicator variable for whether the cyclone is over land or not.

The interpolation is done with splines.  The spline preserves the values at the regular six-hour times and uses a piecewise polynomial to obtain values between times.  For the cyclone positions the splines are done using spherical geometry.  The functions are available in {\bf interpolate.R}.
<<interpolate, echo=TRUE, eval=TRUE>>=
source("interpolate.R")
load("landGrid.RData")
bi = Sys.time()
best.interp = interpolate.tracks(best, 
  get.land=TRUE, createindex=TRUE)
ei = Sys.time()
@
Additionally, a land mask is used to determine whether the location is over land or water.  Be patient, as the interpolation takes time to run.  To see how much time, you save the time (\verb@Sys.time()@) before and after the interpolation and then take the difference in seconds.
<<interpolation time, echo=TRUE, eval=TRUE>>=
round(difftime(ei, bi, units="secs"), 1)
@

The interpolation output is saved in the object \verb@best.interp@ as two lists; the data frame in a list called {\tt data} and the index in a list called {\tt index}.  The index list is the record number by cyclone.  The data frame has \Sexpr{dim(best.interp$data)[1]} rows and \Sexpr{dim(best.interp$data)[2]} columns.  Extra columns include information related to the spherical coordinates and whether the position is over land as well as the cyclone's forward velocity (magnitude and direction).  For instance, the forward velocities are in the column labeled {\tt maguv} in units of kt.  To obtain the average speed in m~s$^{-1}$ over all cyclones, type
<<average forward speed, echo=TRUE, eval=FALSE>>=
mean(best.interp$data$maguv) * .5144
@

Finally, you add a day of year (\verb@jd@) column giving the number of days since January 1st of each year.  This is useful for examining intra-seasonal activity (see Chapter~\ref{chap:timeseriesmodels}).  You use the function \verb@ISOdate@ from the {\bf chron} package on the \verb@ISOtime@ column in the \verb@best.interp$data@ data frame.  You first create a POSIXct object for the origin.
<<add julian day, echo=TRUE, eval=TRUE>>=
x = best.interp$data
start = ISOdate(x$Yr, 1, 1, 0)
current = ISOdate(x$Yr, x$Mo, x$Da, x$hr)
jd = as.numeric(current - start, unit="days")
best.interp$data$jd = jd
rm(x)
@

The hourly North Atlantic cyclone data prepared in the above manner are available in the file {\it best.use.RData}.  The data includes the best-track six-hourly values plus the smoothed and interpolated values using the methods described here as well as a few other quantities.  The file is created by selecting specific columns from the interpolated data frame above.  For example, type
<<create best use, echo=TRUE, eval=TRUE>>=
best.use = best.interp$data[, c("Sid", "Sn", 
  "SYear", "name", "Yr", "Mo", "Da", "hr", "lon",
  "lat", "Wmax", "WmaxS", "DWmaxDt", "Type",
  "Shour", "maguv", "diruv", "jd", "M")]
save(best.use, file="best.use.RData")
@
You input these data as a data frame and list the first six lines by typing
<<load best.use, echo=TRUE, eval=FALSE>>=
load("best.use.RData")
head(best.use)
@
The \verb@load@ function inputs an R object saved as a compressed file with the \verb@save@ function.  The object name in your workspace is the file name without the \verb@.RData@.

Once the hurricane data are prepared in the manner described above you can use functions to extract subsets of the data for particular applications.  Here we consider a function to add regional information to the cyclone locations and another function to obtain the lifetime maximum intensity of each cyclone.  These data sets are used in later (as well as some earlier) chapters.

\subsection{Regional activity}
\label{subsec:regionalactivity}

Information about a cyclones' absolute location is available through the geographic coordinates (latitude and longitude).  It is convenient also to have relative location information specifying, for instance, whether the cyclone is within a pre-defined area.  Here your interest is near-coastal cyclones so you consider three U.S. regions including the Gulf coast, Florida, and the East coast.  The regions are shown in Fig.~\ref{chap06:fig:coastalregions}.  Boundaries are whole number parallels and meridians.  The areas are large enough to capture enough cyclones, but not too large as to include many non-coastal strikes.  
\begin{figure}
\begin{center}
<<coastalregions, fig=TRUE, echo=FALSE, eval=TRUE, width=3.7, height=3.5>>=
regions = read.table("gridboxes.txt")
require(maps)
#require(ggplot2)
states = c("texas", "louisiana", "mississippi", "alabama", "florida", 
  "georgia", "south carolina", "north carolina", "virginia", "maryland", 
  "delaware", "new jersey", "new york", "connecticut", "rhode island", 
  "massachusetts", "new hampshire", "maine")
map(database="state", region=states, xlim=range(regions$V1, na.rm=TRUE),
  ylim=range(regions$V2, na.rm=TRUE), col="grey", resolution=0)
lines(regions$V1, regions$V2)
text(-92, 27, labels="Gulf coast", cex=.8)
text(-76.5, 27, labels="Florida", cex=.8)
text(-71.5, 37, labels="East coast", cex=.8)
box()
#p1 = ggplot(regions, aes(V1, V2)) + 
#  borders("state", states) + theme_bw() + 
#  ylab("Latitude [$^\\circ$N]") +
#  xlab("Longitude [$^\\circ$E]") +
#  geom_polygon(color="red", fill=NA) +
#  geom_text(aes(x1, y1, label=text, angle=c(0, 0, 0)), color="black",
#      data.frame(x1=c(-92, -75, -70), y1=c(25, 27, 37), 
#      text=c("Gulf coast", "Florida", "East coast")), size=3)
#print(p1 + 
#opts(axis.title.x=theme_text(size=5),
#     axis.title.y=theme_text(size=5)))
@
\end{center}
\vspace{-1.5cm}
\caption{Coastal regions.}
\label{chap06:fig:coastalregions}
\end{figure}

Relative location is coded as a logical vector indicating whether or not the cyclone is inside.  The three near-coastal regions are non-overlapping and you create one vector for each region.  But it is also of interest to know whether the cyclone was in either of these areas or none of them.  You create an additional vector indicating whether the cyclone was near the United States.

The functions are in the {\bf datasupport.R}.  They include \verb@import.grid@, which inputs a text file defining the parallels and meridians of the near-coastal regions and adds a regional name with the Gulf coast defined as region one, Florida defined as region two, the East coast defined as region three, and the entire coast as region four.
<<add regions, echo=TRUE, eval=TRUE>>=
source("datasupport.R")
grid = import.grid("gridboxes.txt")
best.use = add.grid(data=best.use, grid=grid)
@

\subsection{Lifetime maximum intensity}

An important variable for understanding hurricane climate is lifetime maximum intensity.  Lifetime refers to the time from genesis to dissipation and lifetime maximum refers to the highest wind speed during this life time.  The intensity value and the location where the lifetime maximum occurred are of general interest.

Here you use the \verb@get.max@ function in the {\bf getmax.R} package.  To make it accessible to your workspace, type
<<source the get max function, echo=TRUE, eval=TRUE>>=
source("getmax.R")
@
For example, to apply the function on the {\tt best.use} data frame using the default options \verb@idfield="Sid"@ and \verb@maxfield="Wmax"@ type
<<get maximum per cyclone wind speed, echo=TRUE, eval=TRUE>>=
LMI.df = get.max(best.use)
@
List the values in ten columns of the first six rows of the data frame rounding numeric variables to one decimal place.
<<header, echo=TRUE, eval=TRUE>>=
round(head(LMI.df)[c(1, 5:9, 12, 16)], 1)
@

The data frame \verb@LMI.df@ contains the same information as {\tt best.use} except here there is only one row per cyclone.  Note the cyclone id variable indicates one row per cyclone.  The row contains the lifetime maximum intensity and the corresponding location and other attribute information for the cyclone at the time the maximum was {\it first} achieved.  If a cyclone is at its lifetime maximum intensity for more than one hour, only the first hour information is saved.  To subset the data frame of lifetime maximum intensities for cyclones of tropical storm intensity or stronger since 1967 and export the data frame as a text file, type
<<subset lifetime maximum intensity, echo=TRUE, eval=TRUE>>=
LMI.df = subset(LMI.df, Yr >= 1967 & WmaxS >= 34)
write.table(LMI.df, file="LMI.txt")
@

\subsection{Regional maximum intensity}

Here your interest is the cyclone's maximum intensity only when the it is within a specified region (e.g., near the coast).  Here you create a set of data frames arranged as a list that extracts the cyclone maximum within each of the regions defined in \S\ref{subsec:regionalactivity}.  You start by defining the first and last year of interest and create a structured list of those years, inclusive.  
<<sequence of years, echo=TRUE, eval=TRUE>>=
firstYear = 1851
lastYear = 2010
sehur = firstYear:lastYear
@
These definitions make it easy for you to add additional seasons of data as they become available or to focus your analysis on data only over the most recent years.

Next define a vector of region names and use the function \verb@get.max.flags@ ({\bf datasupport.R}) to generate the set of data frames saved in the list object \verb@max.regions@.
<<regional maxima, echo=TRUE, eval=TRUE>>=
Regions = c("Basin", "Gulf", "Florida", "East", "US") 
max.regions = get.max.flags(se=sehur, field="Wmax",
  rnames=Regions)
@
You view the structure of the resulting list with the \verb@str@ function.  Here you specify only the highest level of the list by setting the the argument \verb@max.level@ to one.
<<structure of list of data frames, echo=TRUE, eval=TRUE>>=
str(max.regions, max.level=1)
@
The object contains a list of five data frames with names corresponding to the regions defined above.  Each data frame has the same \Sexpr{dim(max.regions$Basin[2])} columns of data defined in {\tt best.use}, but the number of rows depends on the number of cyclones passing through.  Note the \verb@Basin@ data frame contains all cyclones.

To list the first six rows and several of the columns in the \verb@Gulf@ data frame, type
<<head gulf data frame, echo=TRUE, eval=TRUE>>=
head(max.regions$Gulf[c(1:7, 11)])
@
Note how you treat \verb@max.regions$Gulf@ as a regular data frame although it is part of a list. The output indicates that the sixth Gulf cyclone in the record is the \Sexpr{max.regions$Gulf$Sid[6]}th cyclone in the best track record (\verb@Sid@ column) and the \Sexpr{max.regions$Gulf$Sn[6]}st cyclone of the \Sexpr{max.regions$Gulf$SYear[6]} season.  It has a maximum intensity of \Sexpr{round(max.regions$Gulf$Wmax[6]*.5144, 1)}~m~s$^{-1}$ while in the region.  You export the data using the \verb@save@ function as before.
<<export max regions, echo=TRUE, eval=TRUE>>=
save("max.regions", file="max.regions.Rdata")
@

\subsection{Tracks by location}

Suppose you want to know only about hurricanes that have affected a particular location.  Or those that have affected several locations (e.g., San Juan, Miami, and Kingston).  Hurricanes specific to a location can be extracted with functions in the {\bf getTracks} package. To see how this works, load the {\bf best.use} data and install the source code.
<<load best use data and source the functions, echo=TRUE, eval=TRUE>>=
load("best.use.RData")
source("getTracks.R")
@
The function is \verb@get.tracks@.  It takes as input the longitude ($^\circ$E) and latitude of your location along with the search radius (nautical miles) and the number of cyclones and searches for tracks that are within this distance.  It computes a score for each track with closer cyclones getting a higher score.

Here you use it to find the five cyclones (default of at least tropical storm strength) that have come closest to the Norfolk Naval Air Station (NGU) (76.28$^\circ$W longitude and 36.93$^\circ$N latitude) during the period 1900 through 2010.  You save the location and a search radius of 100 nmi in a data frame.  You also set the start and end years of your search and the number of cyclones and then you call \verb@get.tracks@.
<<get tracks example, echo=TRUE, eval=TRUE>>=
loc = data.frame(lon=-76.28, lat=36.93, R=100)
se = c(1900, 2010); Ns = 5
ngu = get.tracks(x=best.use, locations=loc, N=Ns, 
  se=se)
names(ngu)
@
The output contains a list with four objects.  The objects \verb@N@ and \verb@locations@ are the input parameters.  The object \verb@SidDist@ is the unique cyclone identifier for each of the cyclones captured by the input criteria listed from closest to farthest from NGU.  The corresponding track attributes are given in the list object \verb@tracks@ with each component a data frame containing the individual cyclone attributes from \verb@best.use@.  The tracks are listed in order by increasing distance.  For example, \verb@ngu$SidDist[1]@ is the distance of the closest track and \verb@ngu$tracks[[1]]@ is the data frame corresponding to this track.

You plot the tracks on a map reusing the code from Chapter~\ref{chap:graphsandmaps}.  Here you use a gray scale on the track lines corresponding to a closeness ranking with darker lines indicating a relatively closer track.
<<map the tracks, echo=TRUE, eval=FALSE>>=
map("world", ylim=c(12, 60), xlim=c(-90, -50))
points(ngu$location[1, 1], ngu$location[1, 2], 
  col="red", pch=19)
for(i in Ns:1){
clr = gray((i - 1)/Ns)
Lo = ngu$tracks[[i]]$lon
La = ngu$tracks[[i]]$lat
n = length(Lo)
lines(Lo, La, lwd=2, col=clr)
arrows(Lo[n - 1], La[n - 1], Lo[n], La[n], lwd=2,
   length=.1, col=clr)
}
box()
@
The results are shown in Fig.~\ref{fig:closesttracks} for NGU only and for NGU and Roosevelt Naval Air Station in Puerto Rico (NRR).  Darker tracks indicate closer cyclones. The application is useful for cyclone-relative hurricane climatologies (see \cite{ScheitlinEtAl2010}).
\begin{figure}
\begin{center}
<<closesttracks, fig=TRUE, echo=FALSE, eval=TRUE, width=4.5, height=3.2>>=
#Norfolk Naval Air Station [NGU]  lon=-76.28, lat=36.93
#Roosevelt Naval Air Station [NRR] lon=-65.639, lat=18.25
se = c(1900, 2010); Ns = 5
loc1 = data.frame(lon=-76.28, lat=36.93, R=100)
ngu = get.tracks(x=best.use, locations=loc, N=Ns, se=se)
loc2 = data.frame(lon=c(-76.28, -65.639), lat=c(36.93, 18.25), R=100)
ngunrr = get.tracks(x=best.use, locations=loc2, N=Ns, se=se)
par(mfrow=c(1, 2))
map("world", ylim=c(12, 60), xlim=c(-90, -50))
for(i in Ns:1){
clr = gray((i - 1)/Ns)
Lo = ngu$tracks[[i]]$lon
La = ngu$tracks[[i]]$lat
n = length(Lo)
lines(Lo, La, lwd=2, col=clr)
arrows(Lo[n - 1], La[n - 1], Lo[n], La[n], lwd=2, length=.1, col=clr)
}
points(ngu$location[1, 1], ngu$location[1, 2], col="red", pch=19)
box()
mtext("a", side=3, line=1, adj=0, cex=1.1)
map("world", ylim=c(12, 60), xlim=c(-90, -50))
for(i in Ns:1){
clr = gray((i - 1)/Ns)
Lo = ngunrr$tracks[[i]]$lon
La = ngunrr$tracks[[i]]$lat
n = length(Lo)
lines(Lo, La, lwd=2, col=clr)
arrows(Lo[n - 1], La[n - 1], Lo[n], La[n], lwd=2, length=.1, col=clr)
}
points(ngu$location[1, 1], ngu$location[1, 2], col="red", pch=19)
points(ngunrr$location[2, 1], ngunrr$location[2, 2], col="red", pch=19)
box()
mtext("b", side=3, line=1, adj=0, cex=1.1)
@
\end{center}
\vspace{-2cm}
\caption{Five closest cyclones. (a) NGU and (b) NRR and NGU.}
\label{fig:closesttracks}
\end{figure}

\subsection{Attributes by location}

Location-specific hurricane attributes are important for local surge and wind models.  To extract these data you first determining the cyclone observations within a grid box centered on your location of interest.  This is done using the \verb@inside.lonlat@ utility function ({\bf getTracks.R}).  Here your location is NGU from above.
<<determine observations within a great-circle distance, echo=TRUE, eval=TRUE>>=
ins = inside.lonlat(best.use, lon=loc[1, 1],
  lat = loc[1, 2], r = 100)
length(ins)
@
Your grid box size is determined by twice the value in the argument \verb@r@ in units of nautical miles.  The box is square as the distances are computed on a great-circle.  The function returns a logical vector with length equal to the number of cyclone hours in \verb@best.use@.  

Next you subset the rows in \verb@best.use@ for values of \verb@TRUE@ in \verb@ins@.
<<subset for cyclones near location, echo=TRUE, eval=TRUE>>=
ngu.use = best.use[ins, ]
@
Since your interest is cyclones of hurricane intensity, you further subset using \verb@WmaxS@.
<<hurricane observations only, echo=TRUE, eval=TRUE>>=
ngu.use = subset(ngu.use, WmaxS >= 64)
length(unique(ngu.use$Sid))
@
There are \Sexpr{length(unique(ngu.use$Sid))} hurricanes passing through your grid box over the period of record.  A similar subset is obtained using a latitude/longidue grid by typing
<<subset on latitude longitude grid, echo=TRUE, eval=FALSE>>=
d = 1.5
lni = loc[1, 1]
lti = loc[1, 2]
ngu.use = subset(best.use, lat <= lti + d & 
  lat >= lti - d & lon <= lni + d & 
  lni >= lni - d & WmaxS >= 64)
@

Finally use your \verb@get.max@ function to select cyclone-specific attributes.  For example, to determine the distribution of {\it minimum} translation speeds for all hurricanes in the grid and plot them with a histogram you type
<<minimum forward speed per cyclone, fig=FALSE, echo=TRUE, eval=TRUE, results=hide>>=
source("getmax.R")
ngu.use$maguv = -ngu.use$maguv
ngu.use1 = get.max(ngu.use, maxfield="maguv")
speed = -ngu.use1$maguv * .5144
hist(speed, las=1, xlab="Forward Speed (m/s)", 
     main="")
@
Notice that you take the additive inverse of the speed since your interest is in the minimum.
\begin{figure}
\begin{center}
<<minimumforwardspeed, fig=TRUE, echo=FALSE, eval=TRUE, width=3.4, height=3>>=
par(las=1, mgp=c(2, .4, 0), tcl=-.3)
hist(speed, xlab="Minimum translation speed [m s$^{-1}$]",
  ylab="Number of hurricanes", main="")
rug(speed)
@
\end{center}
\vspace{-.75cm}
\caption{Minimum per hurricane translation speed near NGU.}
\label{fig:minimumforwardspeed}
\end{figure}
This type of data are fit to a parametric distribution or resampled as inputs to surge and wind field models (see Chapter~\ref{chap:impactmodels}).

\section{Annual Aggregation}

It is also useful to have cyclone data aggregated in time.  Aggregation is usually done annually since hurricane occurrences have a strong seasonal cycle (see Chapter~\ref{chap:timeseriesmodels}).  The annual aggregation makes it convenient to merge cyclone activity with monthly averaged climate data.

\subsection{Annual cyclone counts}
\label{subsec:annualcyclonecounts}

Annual cyclone counts are probably the most frequently analyzed hurricane climate data.  Here you aggregate cyclone counts by year for the entire basin and the near-coastal regions defined in \S\ref{subsec:regionalactivity}.  First, simplify the region names to their first letter using the \verb@substring@ function making an exception to the U.S. region by changing it back to \verb@US@.
<<simplify regional names, echo=TRUE, eval=TRUE>>=
load("max.regions.RData")
names(max.regions) = substring(names(max.regions),
  1, 1)
names(max.regions)[names(max.regions)=="U"] = "US"
@
This allows you to add the Saffir-Simpson category as a suffix to the names.

The \verb@make.counts.basin@ function ({\bf datasupport.R}) performs the annual aggregation of counts by category and region with the \verb@max.regions@ list of data frames as the input and a list of years specified with the \verb@se@ argument.
<<annual cyclone counts by region and category, echo=TRUE, eval=TRUE>>=
source("datasupport.R")
sehur = 1851:2010
counts = make.counts.basin(max.regions, se=sehur,
  single=TRUE)
str(counts, list.len=5, vec.len=2)
@
The result is a data frame with columns labeled $X.n$ for $n=0, 1, \ldots, 5$, where $X$ indicates the region.  For example, the annual count of hurricanes affecting Florida is given in the column labeled \verb@F.1@.  The start year is 1851 and the end year is 2010.  

Here you create a two-by-two matrix of plots showing hurricane counts by year for the basin, and the U.S., Gulf coast, and Florida regions.  Note how the \verb@with@ function allows you to use the column names with the \verb@plot@ method.
<<plot counts, echo=TRUE, eval=FALSE>>=
par(mfrow=c(2, 2))
with(counts, plot(Year, B.1, type="h", xlab="Year",
  ylab="Basin count"))
with(counts, plot(Year, US.1, type="h", xlab="Year",
  ylab="U.S. count"))
with(counts, plot(Year, G.1, type="h", xlab="Year",
  ylab="Gulf coast count"))
with(counts, plot(Year, F.1, type="h", xlab="Year",
  ylab="Florida count"))
@
\begin{figure}
\begin{center}
<<plotcounts, fig=TRUE, echo=FALSE, eval=TRUE, width=4, height=3.5>>=
par(mfrow=c(2, 2), mex=.6, las=1, mgp=c(2, .4, 0), tcl=-.3)
with(counts, plot(Year, B.1, type="h", xlab="Year", ylab="Basin count"))
mtext("a", side=3, line=1, adj=0, cex=1.1)
with(counts, plot(Year, US.1, type="h", xlab="Year", ylab="U.S. count"))
mtext("b", side=3, line=1, adj=0, cex=1.1)
with(counts, plot(Year, G.1, type="h", xlab="Year", ylab="Gulf coast count"))
mtext("c", side=3, line=1, adj=0, cex=1.1)
with(counts, plot(Year, F.1, type="h", xlab="Year", ylab="Florida count"))
mtext("d", side=3, line=1, adj=0, cex=1.1)
@
\end{center}
\vspace{-.75cm}
\caption{Hurricane counts.  (a) Basin, (b) U.S., (c) Gulf coast, and (d) Florida.}
\label{fig:annualcounts}
\end{figure}
The plots are shown in Fig.~\ref{fig:annualcounts}.  Regional hurricane counts indicate no long-term trend, but the basin-wide counts show an active period beginning late in the 20th century.  Some of this variation is related to fluctuations in climate as examined in Chapter~\ref{chap:frequencymodels}.  Next the annually and regionally aggregated cyclone counts are merged with monthly and seasonal climate variables.

\subsection{Environmental variables}

The choice of variables is large.  You narrow it down by considering what is known about hurricane climate.  For example, it is well understood that ocean heat provides the fuel, a calm atmosphere provides a favorable environment, and the location and strength of the subtropical ridge provides the steering currents.  Thus statistical models of hurricane counts should include covariates that index these climate factors including sea-surface temperature (SST) as an indicator of oceanic heat content, El Ni\~no-Southern Oscillation (ENSO) as an indicator of vertical wind shear, and the North Atlantic Oscillation (NAO) as an indicator of steering flow. Variations in solar activity might also influence hurricane activity.  We speculate that an increase in solar ultraviolet (UV) radiation during periods of strong solar activity might  suppress tropical cyclone intensity as the temperature near the tropopause will warm through absorption of radiation by ozone and modulated by dynamic effects in the stratosphere \citep{ElsnerJagger2008}.

Thus you choose four climate variables including North Atlantic Ocean SST, the Southern Oscillation Index (SOI) as an indicator of ENSO, an index for the NAO, and sunspot numbers (SSN).  Monthly values for these variables are obtained from the following sources.  
\begin{itemize}

\item {\bf SST}: The SST variable is an area-weighted average ($^\circ$C) using values in 5$^\circ$ latitude-longitude grid boxes from the equator north to 70$^\circ$N latitude and spanning the North Atlantic Ocean \citep{EnfieldEtAl2001}.\footnote{From \url{www.esrl.noaa.gov/psd/data/correlation/amon.us.long.data}, November 2011.}  Values in the grid boxes are from a global SST data set derived from the U.K. Met Office \citep{KaplanEtAl1998}.

\item {\bf SOI}: The SOI is the contemporaneous difference in monthly sea-level pressures between Tahiti ($T$) in the South Pacific Ocean and Darwin ($D$) in Australia ($T-D$) \citep{Trenberth1984}.\footnote{From \url{www.cgd.ucar.edu/cas/catalog/climind/SOI.signal.annstd.ascii}, November 2011.}  The SOI is inversely correlated with equatorial eastern and central Pacific SST, so an El Ni\~no warm event is associated with negative values of the SOI.

\item {\bf NAO}: The NAO is the fluctuation in contemporaneous sea-level pressure differences between the Azores and Iceland.  An index value for the NAO is calculated as the difference in monthly normalized pressures at Gibraltar and over Iceland \citep{JonesEtAl1997}.\footnote{From \url{www.cru.uea.ac.uk/~timo/datapages/naoi.htm}, November 2011.}  The NAO index indicates the strength and position of the subtropical Azores/Bermuda High.

\item {\bf SSN}: The SSN variable are the Wolf sunspot numbers measuring the number of sunspots present on the surface of the sun.  They are produced by the Solar Influences Data Analysis Center (SIDC) of World Data Center for the Sunspot Index at the Royal Observatory of Belgium and available from NOAA's National Geophysical Data Center.\footnote{From \url{ftp.ngdc.noaa.gov/STP/SOLAR_DATA/SUNSPOT_NUMBERS/INTERNATIONAL/monthly/MONTHLY}, November 2011.}

\end{itemize}

You combine the above climate and solar variables by month (May through October) and season with the aggregate hurricane counts by year.  You use the \verb@useCov@ ({\bf datasupport.R}) function to input the data.  The file must have a column indicating the year and 12 or 13 additional columns indicating the months and perhaps an annual average.  The argument \verb@miss@ is to input the missing value code used in the file.  The argument \verb@ma@ is for centering and scaling the values.  The default is none; \verb@"c"@ centers, \verb@"cs"@ centers and scales, and \verb@"l"@ subtracts the values in the last row from values in each column.  To accommodate using previous year's data for modeling current year's cyclone counts, the resulting data frame is augmented with columns corresponding to one-year shift of all months using the argument \verb@last=TRUE@.  Column names for the previous year's months are appended with a \verb@.last@.

You input and organize all climate variables at once with the \verb@readClimate@ function.  This helps you avoid copying intermediate objects to your workspace.  Copy and paste the code to your R session.
<<create read climate variables function, echo=TRUE, eval=TRUE>>=
readClimate = function(){
  sst = readCov("data/amon.us.long.mean.txt", 
    header=FALSE, last=TRUE, miss=-99.990, ma="l",
    extrayear=TRUE)
  soi = readCov("data/soi_ncar.txt", last=TRUE,
    miss=-99.9, extrayear=TRUE)
  nao = readCov("data/nao_jones.txt", last=TRUE,
    miss=c(-99.99, 9999), extrayear=TRUE)
  ssn = readCov("data/sunspots.txt", header=TRUE, 
    last=TRUE, extrayear=TRUE)
return(list(sst=sst, soi=soi, nao=nao, ssn=ssn))
}
@
The list of data frames, one for each climate variable, is created by typing
<<read climate variables, echo=TRUE, eval=TRUE>>=
climate = readClimate()
str(climate, max.level=1)
@
Each data frame has 25 columns (variables) corresponding to two sets of monthly values (current and previous year) plus a column of years.  The number of rows (observations) in the data frames varies with the NAO being the longest, starting with the year 1821 although not all months in the earliest years have values.  To list the first six rows and several of the columns in the \verb@nao@ data frame, type
<<head nao data frame, echo=TRUE, eval=TRUE>>=
head(climate$nao[c(1:2, 21:23)])
@
Note how \verb@climate$nao@ is treated as a regular data frame although it is part of a list.

The final step is to merge the climate data with the cyclone counts organized in \S\ref{subsec:annualcyclonecounts}.  This is done by creating a single data frame of your climate variables.  First, create a list of month names by climate variable.  Here you consider only the months from May through October.  You use August through October as a group for the SOI and SST variables, May and June as a group for the NAO variable, and September for the SSN variable.
<<list of month names, echo=TRUE, eval=TRUE>>=
months = list(
  soi=c("May", "Jun", "Jul", "Aug", "Sep", "Oct"),
  sst=c("May", "Jun", "Jul", "Aug", "Sep", "Oct"),
  ssn=c("May", "Jun", "Jul", "Aug", "Sep", "Oct"),
  nao=c("May", "Jun", "Jul", "Aug", "Sep", "Oct"))
monthsglm = list(
  soi=c("Aug", "Sep", "Oct"),
  sst=c("Aug", "Sep", "Oct"),
  ssn="Sep",
  nao=c("May","Jun")) 
@

Next use the \verb@make.cov@ ({\bf datasupport.R}) function on the \verb@climate@ data frame, specifying the month list and the start and end years.  Here you use the word `covariate' in the statistical sense to indicate a variable this is possible predictive of cyclone activity.  A covariate is also called an explanatory variable, an independent variable, or a predictor.
<<make covariates, echo=TRUE, eval=TRUE>>=
covariates = cbind(make.cov(data=climate, 
  month=months, separate=TRUE, se=sehur),
  make.cov(data=climate, month=monthsglm,
  separate=FALSE, se=sehur)[-1])
@
The \verb@cbind@ function brings together the columns into a single data frame.  The last six rows and a sequence of columns from the data frame are listed by typing,
<<show tail of covariates data frame, echo=TRUE, eval=TRUE>>=
tail(covariates[seq(from=1, to=29, by=5)])
@
The columns are labeled $X.m$, where $X$ indicates the covariate (\verb@soi@, \verb@sst@, \verb@sun@, and \verb@nao@) and $m$ indicates the month using a three-letter abbreviation with the first letter is capitalized.  Thus, for example, June values of the NAO index are in the column labeled \verb@nao.Jun@.  The hurricane-season averaged covariate is also given in a column labeled without the month suffix.  Season averages use August through October for SST and SOI, May and June for NAO, and September only for SSN.

As you did with the counts, here you create a two-by-two plot matrix showing the seasonal-averaged climate and solar variables by year (Fig.~\ref{fig:climatevariables}).
<<plot counts, echo=TRUE, eval=FALSE>>=
par(mfrow=c(2, 2))
with(covariates, plot(Year, sst, type="l", 
  xlab="Year", ylab="SST [C]"))
with(covariates, plot(Year, nao, type="l", 
  xlab="Year", ylab="NAO [s.d.]"))
with(covariates, plot(Year, soi, type="l", 
  xlab="Year", ylab="SOI [s.d.]"))
with(covariates, plot(Year, ssn, type="l", 
  xlab="Year", ylab="Sunspot Count"))
@
\begin{figure}
\centering
<<plotcovariates, fig=TRUE, echo=FALSE, eval=TRUE, width=4, height=3.5>>=
par(mfrow=c(2, 2), mex=.6, las=1, mgp=c(2.7, .4, 0), tcl=-.3)
with(covariates, plot(Year, sst, type="l", xlab="Year", 
  ylab="SST [$^\\circ$C]"))
grid()
with(covariates, lines(Year, sst, type="l"))
mtext("a", side=3, line=1, adj=0, cex=1.1)
with(covariates, plot(Year, nao, type="l", xlab="Year", ylab="NAO [s.d.]"))
grid()
with(covariates, lines(Year, nao, type="l"))
mtext("b", side=3, line=1, adj=0, cex=1.1)
with(covariates, plot(Year, soi, type="l", xlab="Year", ylab="SOI [s.d.]"))
grid()
with(covariates, lines(Year, soi, type="l"))
mtext("c", side=3, line=1, adj=0, cex=1.1)
with(covariates, plot(Year, ssn, type="l", xlab="Year", ylab="Sunspot number"))
grid()
with(covariates, lines(Year, ssn, type="l"))
mtext("d", side=3, line=1, adj=0, cex=1.1)
@
\vspace{0cm}
\caption{Climate variables. (a) SST, (b) NAO, (c) SOI, and (d) sunspots.}
\label{fig:climatevariables}
\end{figure}
The long-term warming trend in SST is quite pronounced as is the cooling trend during the 1960s and 1970s.  The NAO values shows large year-to-year variations and a tendency for negative values during the early part of the 21st century.  The SOI values also show large interannual variations.  Sunspot numbers show a pronounced periodicity near 11 years (solar cycle) related to changes in the solar dynamo.

Finally, you use the \verb@merge@ function to combine the \verb@counts@ and \verb@covariates@ data frames, merging on the variable \verb@Year@.
<<merge the covariates with the counts, echo=TRUE, eval=TRUE>>=
annual = merge(counts, covariates, by="Year")
save(annual, file="annual.RData")
@
The result is a single data frame with \Sexpr{dim(annual)[1]} rows and \Sexpr{dim(annual)[2]} columns.  The rows correspond to separate years and the columns include the cyclone counts by Saffir-Simpson scale and the monthly covariates defined here.  The data frame is exported to the file {\it annual.RData}.  

\section{Coastal County Winds}

\subsection{Description}

Other hurricane data sets are available.  County wind data are compiled in \cite{JarrellEtAl1992} from reports on hurricane experience levels for coastal counties from Texas to Maine.  The data are coded by Saffir-Simpson category and are available as an Excel\texttrademark spreadsheet.\footnote{From \url{www.aoml.noaa.gov/hrd/hurdat/Data_Storm.html}, November 2011.}

The file consists of one row for each year and one column for each county.  A cell contains a number if a tropical cyclone affected the county in a given year otherwise it is left blank.  The number is the Saffir-Simpson intensity scale.  For example, a county with a value of 2 indicates category two scale wind speeds were likely experienced at least somewhere in the county.  If the number is inside parentheses, then the county received an indirect hit and the highest winds were likely at least one category weaker.  Cells having multiple entries, separated by commas, indicate the county was affected by more than one hurricane that year.

The data set is originally constructed as follows. First a Saffir-Simpson category is assigned to the hurricane at landfall based on central pressure and wind intensity estimates in the best-track dataset.  Some subjectivity enters the assignment particularly with hurricanes during earlier years of the 20th century and with storms moving inland over a sparsely populated area. Thus there is some ambiguity about the category for hurricanes with intensities near the category cutoff.  The category is sometimes adjusted based on storm surge estimates, in which case the central pressure may not agree with the scale assignment. Beginning with the 1996 hurricane season, scale assignments are based solely on maximum winds.

Second, a determination is made about which coastal counties received direct and indirect hits.  A direct hit is defined as the innermost core regions, or `eye,' moving over the county. Each hurricane is judged individually based on the available data, but a general rule of thumb is applied in cases of greater uncertainty.  That is, a county is regarded as receiving a direct hit when all or part of a county falls within R to the left of a storm's landfall and 2R to the right (with respect to an observer at sea looking toward shore), where R is the radius to maximum winds.  R is defined as the distance from the cyclone's center to the circumference of maximum winds around the center.

The determination of an indirect hit was based on a hurricane's strength and size and on the configuration of the coastline. In general, it is determined that the counties on either side of the direct-hit zone that received hurricane-force winds or tides of at least 1--2~m above normal are considered an indirect hit.  Subjectivity is also necessary here because of storm paths and coastline geography

Table~\ref{tab:windspeedranges} lists the possible cell entries for a given hurricane and our interpretations of the symbol in terms of the Saffir-Simpson category and wind speed range.  The first column is the symbol used in Appendix C of \cite{JarrellEtAl1992}.  The second column is the corresponding Saffir-Simpson scale range likely experienced somewhere in the county.  The third column is the interpreted maximum sustained (1 min) near-surface (10 m) wind speed range (m~s$^{-1}$).
\begin{table}
\caption{\label{tab:windspeedranges} Data symbols and interpretation.  The symbol is from Appendix C of \cite{JarrellEtAl1992}.}
\begin{center}
\begin{tabular}{ccc}
\hline
          & Saffir-Simpson & Wind Speed Range \\
Symbol    & Range          & (m~s$^{-1}$)            \\ \hline
(1)       & [0,1)          & 33--42 \\
1         & [1,2)          & 33--42  \\
(2)       & [1,2)          & 33--42  \\
2         & [2,3)          & 42--50  \\
(3)       & [1,3)          & 33--50  \\
3         & [3,4)          & 50--58  \\
(4)       & [1,4)          & 33--58  \\
4         & [4,5)          & 58--69  \\
(5)       & [1,5)          & 33--69  \\
5         & [5,$\infty$)   & 69--1000 \\ \hline
\end{tabular}
\end{center}
\end{table}

The data are incomplete in the sense that you have a range of wind speeds rather than a single estimate.  In the statistical literature the data are called `interval censored.'   Note that (1) is the same as 1 since they both indicate a cyclone with at least hurricane force winds.

\subsection{Counts and magnitudes}

The raw data need to be organized.   First, you remove characters that code for information not used.  This includes codes such as `W',`E',`*' and '\_'.  Second, all combinations of multiple hit years, say (1, 2), are converted to (1), (2) for parsing.  Once parsed all table cells consist of character strings that are either blank or contain cyclone hit information separated by commas.  The cleaned data are input to R with the first column used for row names by typing
<<read cleaned data, echo=TRUE, eval=TRUE>>=
cd = read.csv("HS.csv", row.names=1, header=TRUE)
@
The state row is removed and saved as a separate vector by typing
<<remove state row, echo=TRUE, eval=TRUE>>=
states = cd[1,]
cdf = cd[-1,]
cdf[c(1, 11), 1:4]
@
Rows one and eleven are printed so you can see the data frame structure.  In 1900 these four southern Texas counties were not affected by a hurricane (blank row), but a year later, Kenedy and Kleberg counties had a direct hit by a category two hurricane that was also felt indirectly in Cameron and Willacy counties.

Next you convert this data frame into a matrix object containing event counts and list object for event magnitudes.  First set up a definition table to convert the category data to wind speeds.  Note the order is (1), $\ldots$, (5), 1, $\ldots$, 5. The column names \verb@time@ and \verb@time2@ are required for use with \verb@Surv@ function to create a censored data type.
<<wind table, echo=TRUE, eval=TRUE>>=
wt = data.frame(
  time = c(rep(33, 6), 42, 50, 58, 69),
  time2 = c(42, 42, 50, 58, 69, 42, 50, 58, 69, 
  1000))
rownames(wt) = c(paste("(", 1:5, ")", sep=""),
  paste(1:5))
@
Next expand the data frame into a matrix. Each entry of the matrix is a character string vector.  The character string is a zero vector for counties without a hurricane for a given year.  For counties with a hurricane or hurricanes, the string contains symbols as shown in Table~\ref{tab:windspeedranges}, one for each hurricane.  This is done using \verb@apply@ and the \verb@strsplit@ function as follows.
<<parse data, echo=TRUE, eval=TRUE>>=
pd = apply(cdf, c(1, 2), function(x)
  unlist(strsplit(gsub(" ", "", x), ",")))
@

Next extract a matrix of counts and generate a list of events one for each county along with a list of years required for matching with the covariate data.  Note that the year is extracted from the names of the elements.
<<counts and events, echo=TRUE, eval=TRUE>>=
counts = apply(pd, c(1, 2), function(x) 
  length(x[[1]]))
events = lapply(apply(pd, 2, unlist),
  function(x)
  data.frame(Year=as.numeric(substr(names(x), 1, 4)),
  Events=x, stringsAsFactors=FALSE))
@

Finally convert events to wind speed categories. You do this using the \verb@Surv@ function from the {\bf survival} package as follows
<<convert events to wind speed categories, echo=TRUE, eval=TRUE>>=
require(survival)
winds = lapply(events, function(x) 
  data.frame(Year=x$Year,
  W = do.call("Surv", c(wt[x$Events, ],
  list(type="interval2")))))
@
The object \verb@winds@ is a list of the counties with each list containing a data frame.  To extract the data frame from county 57 corresponding to Miami-Dade county, type
<<extract Miami, echo=TRUE, eval=TRUE>>=
miami = winds[[57]]
class(miami)
head(miami)
@
The data frame contains a numerical year variable and a categorical survival variable.  The survival variable has three components indicating the minimum and maximum Saffir-Simpson category.

You will use the \verb@winds@ and \verb@counts@ objects in Chapter~\ref{chap:intensitymodels} to create a probability model for winds exceeding various threshold levels.  Here you export the objects as separate files using the \verb@save@ function so you can read them back using the \verb@load@ function.
<<export the experience level data, echo=TRUE, eval=TRUE>>=
save(winds, file="catwinds.RData")
save(counts, file="catcounts.RData")
@
The saved files are binary (8 bit characters) to ensure they transfer without converting end-of-line markers.

\section{NetCDF Files}

Spatial climate data like monthly SST grids are organized as arrays and stored in netCDF files.  NetCDF (Network Common Data Form) is a set of software libraries and data formats from the Unidata community that support the creation, access, and sharing of data arrays.  The National Center for Atmospheric Research (NCAR) uses netCDF files to store large data sets.  The {\bf ncdf} package \citep{Pierce2011} provides functions for working with netCDF files in R.  Install the package by typing
<<require ncdf, echo=TRUE, eval=TRUE>>=
require(ncdf)
@
You also might be interested in the functions available in {\bf RNetCDF} for processing netCDF files.

Here your interest is with NOAA's extended reconstructed SST version 3b data set for the North Atlantic Ocean.\footnote{From \url{www.esrl.noaa.gov/psd/data/gridded/data.noaa.ersst.html}, November 2011.}  The data are provided by the NOAA/OAR/ESRL PSD in Boulder, Colorado.  The data are available in file {\it sstna.nc} for the domain bounded by the equator and 70$^\circ$N latitude and 100$^\circ$W and 10$^\circ$E longitude for the set of months starting with January 1854 through November 2009.

First, use the function \verb@open.ncdf@ to input the SST data.
<<read the ncdf, echo=TRUE, eval=TRUE>>=
nc = open.ncdf("sstna.nc")
@
Next, convert the \verb@nc@ object of class ncdf into a three dimension array and print the array's dimensions.
<<convert ncdf SST into an 3D array, echo=TRUE, eval=TRUE>>=
sstvar = nc$var$sst
ncdata = get.var.ncdf(nc, sstvar)
dim(ncdata)
object.size(ncdata)
@
The file contains \Sexpr{dim(ncdata)[1]*dim(ncdata)[2]*dim(ncdata)[3]} monthly SST values distributed across \Sexpr{dim(ncdata)[1]} longitudes, \Sexpr{dim(ncdata)[2]} latitudes and \Sexpr{dim(ncdata)[3]} months.

Additional work is needed.  First, extract the array dimensions as vector coordinates of longitudes, latitudes, and time.  Then change the longitudes to negative west of the prime meridian and reverse the latitudes to increase from south to north.  Also, convert the time coordinate to a POSIX time (see Chapter~\ref{chap:graphsandmaps}) using January 1, 1800 as the origin.
<<add time column, echo=TRUE, eval=TRUE>>=
vals = lapply(nc$var$sst$dim, function(x) 
  as.vector(x$vals))
vals[[1]] = (vals[[1]] + 180) %% 360 - 180
vals[[2]] = rev(vals[[2]])
timedate = as.POSIXlt(86400 * vals[[3]], 
  origin=ISOdatetime(1800, 1, 1, 0, 0, 0, tz="GMT"),
  tz="GMT")
timecolumn = paste("Y", 1900 + timedate$year, "M",
  formatC(as.integer(timedate$mo + 1), 1, flag="0"), 
  sep="")
names(vals) = sapply(nc$var$sst$dim, "[", "name")
vals = vals[1:2]
@
Note that the double percent symbol is the modulo operator, which finds the remainder of a division of the number to the left of the symbol by the number to the right.

Next coerce the array into a data frame with one column per time period and assign column names.
<<create data frame, echo=TRUE, eval=TRUE>>=
ncdata1 = ncdata[, (dim(ncdata)[2]:1), ]
dims = dim(ncdata1)
dim(ncdata1) = c(dims[1] * dims[2], dims[3])
colnames(ncdata1) = timecolumn
ncdata1 = as.data.frame(ncdata1)
ncdataframe = cbind(expand.grid(vals), ncdata1)
@
Then find missing values at non-land locations and save the resulting data frame.
<<remove missing values at non-land locations, echo=TRUE, eval=TRUE>>=
misbyrow = apply(ncdataframe, 1, function(x) 
  sum(is.na(x)))
ncdataframe = ncdataframe[misbyrow==0, ]
save("ncdataframe", file="ncdataframe.RData")
@

Finally, to create a subset data frame with only July 2005 SST values on your latitude-longitude grid, type
<<particular month SST, echo=TRUE, eval=TRUE>>=
sst = ncdataframe[paste("Y2005", "M", 
  formatC(7, 1, flag="0"), sep="")]
names(sst) = "SST"
sst$lon = ncdataframe$lon
sst$lat = ncdataframe$lat
write.table(sst, file="sstJuly2005.txt")
@
This data frame is used in Chapters~\ref{chap:frequencymodels} and \ref{chap:spatialmodels}.

This chapter showed how to extract useful data sets from original source files.  We began by showing how to create a spreadsheet-friendly flat file from the available best tracks.  We showed how to add value by smoothing, interpolating, and computing derivative variables.  We also showed how to parse the data regionally and locally and create a subset based on the cyclone's lifetime maximum intensity.  Next we demonstrated how to aggregate the data annually and insert relevant environmental variables.  We then examined a coastal county wind data set and how to work with NetCDF files.

Part II focuses on using these data to analyze and model hurricane activity.  We begin with models for hurricane frequency.