\chapter{Intensity Models} \label{chap:intensitymodels} \SweaveOpts{keep.source=TRUE, pdf=FALSE, prefix.string=Chap08, grdevice=tikz.Swd} <>= options(digits=3, show.signif.stars=FALSE, width=53) rm(list=ls()) require(tikzDevice) source("../SweaveTikZ.R") @ %data files: read.table("LMI.txt", T); read.table("SST.txt", T); read.table("SOI.txt", T); load("catwinds.RData"); load("catcounts.RData") %packages: quantreg, xtable, ismev, gamlss, gamlss.cens %source code: source("CountyWinds.R") %third party: NONE \begin{quote} ``We must think about what our models mean, regardless of fit, or we will promulgate nonsense.'' \end{quote} \indent---Leland Wilkinson\\ Often interest is not on the average, but on the strongest. Strong hurricanes, like Camille in 1969, Andrew in 1992, and Katrina in 2005, cause catastrophic damage. It's important to have an estimate of when the next big one will occur. You also might want to know what influences the strongest hurricanes and if they are getting stronger. This chapter shows you how to model hurricane intensity. The data are basin-wide lifetime highest intensities for individual tropical cyclones over the North Atlantic and county-level hurricane wind intervals. We begin by considering trends using the method of quantile regression and then examine extreme-value models for estimating return period of the strongest hurricanes. We also look at modeling cyclone winds when the values are given by category and use Miami-Dade County as an example. \section{Lifetime Highest Intensity} \label{sec:lifetimehighestintensity} Here we consider cyclones above tropical storm intensity ($\ge$ 17 m~s$^{-1}$) during the period 1967--2010, inclusive. The period is long enough to see changes but not too long that it includes intensity estimates before satellite observations. We use `intensity' and `strength' synonymously to mean the fastest wind inside the hurricane. \subsection{Exploratory analysis} Consider the set of events defined by the location and wind speed at which a tropical cyclone first reaches its lifetime maximum intensity (see Chapter~\ref{chap:graphsandmaps}). The data are in the file {\it LMI.txt}. Import and list the values in ten columns of the first six rows of the data frame by typing <>= LMI.df = read.table("LMI.txt", header=TRUE) round(head(LMI.df)[c(1, 5:9, 12, 16)], 1) @ The dataset is explained and described in Chapter~\ref{chap:datasets}. Here our interest is the smoothed intensity estimate at the time of lifetime maximum ({\tt WmaxS}). First convert the wind speeds from the operational units of knots to the SI units of m~s$^{-1}$. <>= LMI.df$WmaxS = LMI.df$WmaxS * .5144 @ Next, determine the quartiles (0.25 and 0.75 quantiles) of the wind speed distribution. The quartiles divide the cumulative distribution function (CDF) into three equal-sized subsets. <>= quantile(LMI.df$WmaxS, c(.25, .75)) @ We find that 25\% of the cyclones have a maximum wind speed less than \Sexpr{round(quantile(LMI.df$WmaxS,.25),0)}~m~s$^{-1}$ and 75\% have a maximum wind speed less than \Sexpr{round(quantile(LMI.df$WmaxS,.75),0)}~m~s$^{-1}$ so that 50\% of all cyclones have a maximum wind speed between \Sexpr{round(quantile(LMI.df$WmaxS,.25),0)} and \Sexpr{round(quantile(LMI.df$WmaxS,.75),0)}~m~s$^{-1}$ (interquartile range--IQR). Similarly, the quartiles (deciles) divide the sample of storm intensities into four (ten) groups with equal proportions of the sample in each group. The quantiles, or percentiles refer to the general case. The cumulative distribution function (CDF) gives the empirical probability of observing a value in the record less than a given wind speed maximum. The quantile function is the inverse of the CDF allowing you to determine the wind speed for specified quantiles. Both functions are monotonically nondecreasing. Thus, given a sample of maximum wind speeds $w_1,\ldots, w_n$, the $\tau$th sample quantile is the $\tau$th quantile of the corresponding empirical CDF. Formally, let $W$ be a random maximum storm intensity then the $k$th `$q$'-quantile is defined as the value `$w$' such that \begin{equation} p(W \le w) \ge \tau \hbox{ and } p(W \ge w) \ge 1 - \tau, \end{equation} where $\tau=\frac{k}{n}$. Figure~\ref{fig:ecdf:quant:lmi} shows the cumulative distribution and quantile functions for the \Sexpr{dim(LMI.df)[1]} tropical cyclone intensities in the data frame. The CDF appears to have three distinct regions, indicated by the vertical lines. The function is nearly a straight line for intensities less than 40 m~s$^{-1}$ and greater than 65 m~s$^{-1}$. \begin{figure} \centering <>= pt = seq(0, 1, .01) qp = quantile(LMI.df$WmaxS, prob=pt) par(mfrow=c(1, 2), las=1, mgp=c(2, .4, 0), tcl=-.3) plot(ecdf(LMI.df$WmaxS), xlab="Wind speed [m s$^{-1}$]", ylab="Cumulative distribution", main="") abline(v=40, col="red"); abline(v=65, col="red") rug(LMI.df$WmaxS) mtext("a", side=3, line=1, adj=0, cex=1.1) plot(pt, qp, type="l", xlab="Quantile", ylab="Wind speed [m s$^{-1}$]", lwd=2) mtext("b", side=3, line=1, adj=0, cex=1.1) @ \vspace{-.75cm} \caption{Fastest cyclone wind. (a) Cumulative distribution and (b) quantile.} \label{fig:ecdf:quant:lmi} \end{figure} Is there a trend in cyclone intensities? Start with a plot of your data. By specifying the first argument in the \verb@boxplot@ function as a model formula you create a sequence of conditional box plots. For example, to create a series of wind speed box plot conditional on year, type <>= boxplot(LMI.df$WmaxS ~ as.factor(LMI.df$SYear)) @ Note that the conditioning variable must be specified as a factor. The graph is useful for a quick look at the distribution of your wind speed data over time. Recall from Chapter~\ref{chap:graphsandmaps} we created a series of box plots of the SOI by month that minimized the amount of redundant ink. Here we reuse this code, modifying it a bit, to create a series of wind speed box plots by year. Begin by creating a vector of years and saving the length of the vector as a numeric object. <>= yrs = 1967:2010 n = length(yrs) @ Next create the plot frame without the data and without the horizontal axis tic labels. You also add a label to the vertical axis. <>= plot(c(1967, 2010), c(15, 85), type="n", xaxt="n", bty="n", xlab="", ylab="Lifetime maximum wind speed (m/s)") axis(1, at=yrs, labels=yrs, cex=.5) @ The function \verb@fivenum@ lists the minimum, first quartile, median, third quartile, and maximum value in that order, so to obtain the median value from a vector of values called \verb@x@, you type \verb@fivenum(x)[3]@. Thus you loop over each year indexed by \verb@i@ and plot the median wind speed value for that year as a point using the \verb@points@ function. In the same loop you create vertical lines connecting the minimum with the first quartile and the third quartile with the maximum using the \verb@lines@ function. <>= for(i in 1:n){ fn = fivenum(LMI.df$WmaxS[LMI.df$SYear == yrs[i]]) points(yrs[i], fn[3], pch=19) lines(c(yrs[i], yrs[i]), c(fn[1], fn[2])) lines(c(yrs[i], yrs[i]), c(fn[4], fn[5])) } @ Note the subset operator \verb@[@ is used to obtain wind speed values by year. The results are shown in Fig.~\ref{fig:box:lmi}. Here we added the least-squares regression line about the annual mean lifetime highest wind speed (black line) and the least-squares regression line about the annual lifetime highest wind speed (red). While there is no upward or downward trend in the average cyclone intensity, there is an upward trend to the set of strongest cyclones. \begin{figure} \centering <>= par(las=1, mgp=c(2, .4, 0), tcl=-.3) maxw = numeric() plot(c(1967, 2010), c(15, 85), type="n", xaxt="n", bty="n", xlab="Year", ylab="Wind speed [m s$^{-1}$]") axis(1, at=yrs, labels=yrs, cex=.5) for(i in 1:n){ fn = fivenum(LMI.df$WmaxS[LMI.df$SYear == yrs[i]]) points(yrs[i], fn[3], pch=19) lines(c(yrs[i], yrs[i]), c(fn[1], fn[2])) lines(c(yrs[i], yrs[i]), c(fn[4], fn[5])) maxw[i] = fn[5] } abline(lm(maxw ~ yrs), col="red", lwd=2) abline(lm(WmaxS ~ SYear, data=LMI.df), lwd=2) @ \vspace{-.75cm} \caption{Lifetime highest wind speeds by year.} \label{fig:box:lmi} \end{figure} The theory of maximum potential intensity, which relates intensity to ocean heat, refers to a theoretical limit given the thermodynamic conditions \citep{Emanuel1988}. So the upward trend in the observed lifetime maximum intensity is physically consistent with what you expect given the increasing ocean temperature. It is informative then to explore the relationship of lifetime highest wind speed to ocean temperature. In Chapter~\ref{chap:Rtutorial} you imported the monthly North Atlantic sea-surface temperature (SST) data by typing <>= SST = read.table("SST.txt", header=TRUE) @ Here you subset the data using years since 1967, and keep June values only. <>= lg = SST$Year >= 1967 sst.df = data.frame(Yr=SST$Year[lg], sst=SST$Jun[lg]) @ Next merge your SST data frame with your cyclone intensity data. This is done using the \verb@merge@ function. Merge is performed on the common column name \verb@Yr@ as specified with the \verb@by@ argument. <>= lmisst.df = merge(LMI.df, sst.df, by="Yr") head(lmisst.df[c("Yr", "WmaxS", "sst")]) @ Note that since there are more instances of {\tt Yr} in the intensity data frame (one for each cyclone), the June SST values in the SST data frame get duplicated for each instance. Thus all cyclones for a particular year get the same SST value as it should be. You are interested in regressing cyclone intensity on SST as you did above on the year, but the SST values are continuous rather than discrete. So you first create SST intervals. This is done with the \verb@cut@ function. <>= brk = quantile(lmisst.df$sst, prob=seq(0, 1, .2)) sst.i = cut(lmisst.df$sst, brk, include.lowest=TRUE) @ Your cuts divide the SST values into five equal quantiles (pentiles). The intervals represent categories of much below normal, below normal, normal, above normal, and much above normal SST. The choice of quantiles is a compromise between having enough years for a given range of SSTs and having enough quantiles to assess differences. You repeat this procedure for your SOI data. You create a merged data frame and cut the SOI values into pentads. <>= SOI = read.table("SOI.txt", header=TRUE) lg = SOI$Year >= 1967 soi.df = data.frame(Yr=SOI$Year[lg], soi=SOI$Sep[lg]) lmisoi.df = merge(LMI.df, soi.df, by="Yr") brk = quantile(lmisoi.df$soi, prob=seq(0, 1, .2)) soi.i = cut(lmisoi.df$soi, brk, include.lowest=TRUE) @ Finally, you create a series of box plots corresponding to the SST intervals. This time you use the \verb@boxplot@ function as described above. Begin by creating a character vector of horizontal axis labels corresponding to the SST intervals and, to simplify the code, save the wind speeds as a vector. <>= xlabs = c("MB", "B", "N", "A", "MA") W = lmisst.df$WmaxS @ You then save the output from a call to the \verb@boxplot@ function, making sure to turn off the plotting option. <>= y = boxplot(W ~ sst.i, plot=FALSE) @ Initiate the plot again and add regression lines through the medians and third quartile values using the saved statistics of the box plot and regressing on the sequence from one to five. <>= boxplot(W ~ sst.i, notch=TRUE, names=xlabs, xlab="June SST Quantiles", ylab="Lifetime Maximum Wind Speed (m/s)") x = 1:5 abline(lm(y$stats[3, ] ~ x), lwd=2) abline(lm(y$stats[4, ] ~ x), col="red", lwd=2) @ The results are shown in Fig.~\ref{fig:box:lmisstsoi}. Here we repeat the code using the September SOI covariate and create two box plots. \begin{figure} \centering <>= par(mfrow=c(1, 2), las=1, mgp=c(2, .4, 0), tcl=-.3) y = boxplot(W ~ sst.i, plot=FALSE) boxplot(W ~ sst.i, notch=TRUE, names=xlabs, xlab="June SST pentile", ylab="Wind speed [m s$^{-1}$]", cex.axis=.7) x = 1:5 abline(lm(y$stats[3, ] ~ x), lwd=2) abline(lm(y$stats[4, ] ~ x), col="red", lwd=2) mtext("a", side=3, line=1, at=0, cex=1.1) y = boxplot(W ~ soi.i, plot=FALSE) boxplot(W ~ soi.i, notch=TRUE, names=xlabs, xlab="September SOI pentile", ylab="Wind speed [m s$^{-1}$]", cex.axis=.7) abline(lm(y$stats[3,] ~ x), lwd=2) abline(lm(y$stats[4,] ~ x), col="red", lwd=2) mtext("b", side=3, line=1, at=0, cex=1.1) @ \vspace{-.75cm} \caption{Lifetime highest intensity by (a) June SST and (b) September SOI.} \label{fig:box:lmisstsoi} \end{figure} The first pentad is the lowest 20 percent of all values. The upper and lower limits of the boxes represent the first and third quartiles of cyclone intensity. The median for each group is represented by the horizontal bar in the middle of each box. Notches on the box sides represent an estimated confidence interval about the median. The full range of the observed intensities in each group is represented by the horizontal bars at the end of the dashed whiskers. In cases where the whiskers would extend more than one and half times the interquartile range, they are truncated and the remaining outlying points are indicated by open circles. The red line is the best-fit line through the upper quartile and the black line is through the medians. The box plot summarizes the distribution of maximum storm intensity by pentiles of the covariate. The graphs show a tendency for the upper quantiles of cyclone intensity values to increase with both SST and SOI. As SST increases so does the intensity of the strongest cyclones. Also as SOI increases (toward more La Ni\~na-like conditions) so does the intensity of the cyclones. Results from your exploratory analysis give you reason to continue your investigation. The next step is to model these data. The box plots provide evidence that a model for the mean will not capture the relationships as the trends are larger for higher quantiles. So instead of linear regression you use quantile regression. \subsection{Quantile regression} The quantile function and the conditional box plots shown above are useful for exploratory analysis. They are adequate for describing and comparing univariate distributions. However, since you are interested in modeling the relationship between a response variable (intensity) and the covariates (SST and SOI) it is necessary to introduce a regression-type model for the quantile function. Quantile regression extends ordinary least squares regression model to conditional quantiles of the response variable. Although you used linear regression on the conditional quantiles in the plots above, this is not the same as quantile regression on the covariates. Quantile regression allows you to examine the relationship without the need to consider discrete levels of the covariate. Ordinary regression model specifies how the mean changes with changes in the covariates while the quantile regression model specifies how the quantile changes with changes in the covariates. Quantile regression relies on empirical quantiles, but uses parameters to assess the relationship between the quantile and the covariates. The quantile regression model with two covariates is given by \begin{equation} \hat \mu(\tau) = \hat \beta_0(\tau) + \hat \beta_1(\tau) x_1 + \hat \beta_2 (\tau) x_2\\ \end{equation} where $\hat \mu(\tau)$ is the predicted conditional quantile of tropical cyclone intensity ($W$) and where the $\hat \beta_i$'s are obtained by minimizing the piecewise linear least absolute deviation function given by \begin{equation} \frac{1-\tau}{n}\sum_{w_i < q_i}|w_i - q_i| + \frac{\tau}{n}\sum_{w_i > q_i}|w_i - q_i|. \end{equation} for a given $\tau$, where $q_i$ is the predicted $\tau$ quantile corresponding to observation $i$ ($\hat \mu_i(\tau)$). The value of a simple trend analysis (involving only one variable---usually time) is limited by the fact that other explanatory variables also might be trending. In the context of hurricane intensity, it is well known that the ENSO cycle can significantly alter the frequency and intensity of hurricane activity on the seasonal time scale. A trend over time in hurricane intensity could simply reflect a change in this cycle. Thus it is important to look at the trend after controlling for this important factor. Here we show the trend as a function of Atlantic SST after controlling for the ENSO cycle. Thus we answer the question of whether the data support the contention that the increasing trend in the intensity of the strongest hurricanes is related to an increase in ocean warmth conditional on ENSO. Here we use the superb {\bf quantreg} package for performing quantile regression developed by Roger Koenker. Load the package and print a BibTeX citation. <>= require(quantreg) x = citation(package="quantreg") toBibtex(x) @ Begin with median regression. Here $\tau$ is set to 0.5 and is specified with the \verb@tau@ argument. The function \verb@rq@ performs the regression. The syntax for the model formula is the same as with \verb@lm@ and \verb@glm@. The output is assigned to the object \verb@qrm@. <>= Year = lmisst.df$Yr W = lmisst.df$WmaxS SOI = lmisoi.df$soi SST = lmisst.df$sst qrm = rq(W ~ Year + SST + SOI, tau=.5) @ Rather than least squares or maximum likelihoods, by default a simplex method is used to fit the regression. It is a variant of the \cite{BarrodaleRoberts1974} approach described in \cite{KoenkerdOrey1987}. If your data set has more than a few thousand observations it is recommended that you change the default by specifying \verb@method="fn"@, which invokes the Frisch-Newton algorithm described in \cite{PortnoyKoenker1997}. You obtain a concise summary of the regression results by typing <>= qrm @ The output shows the estimated coefficients and information about the degrees of freedom. You find that the median lifetime intensity decreases with year (negative trend) and increases with both SST and the SOI. To obtain more details, you type <>= summary(qrm) @ <>= require(xtable) s = summary(qrm) tbl = xtable(s$coefficients, label='tab:coefmedianregression', caption='Coefficients of the median regression model.') print(tbl, math.style.negative=TRUE, caption.placement="top") @ Table~\ref{tab:coefmedianregression} gives the estimated coefficients and confidence intervals (95\%) for these parameters. The confidence intervals are computed by the rank inversion method developed in \cite{Koenker2005}. The confidence interval includes zero for Year and the SOI indicating these terms are not significant in explaining the median per cyclone intensity. However, the SST variable is significant and positive. The relationship indicates that for every 1$^\circ$C increase in SST the median intensity increases by \Sexpr{round(qrm$coefficients[3],1)} m~s$^{-1}$. But this seems too large (by an order of magnitude) given the box plot (Fig.~\ref{fig:box:lmisstsoi}) and the range of SST values. <>= range(SST) @ The problem stems from the other variables in the model. To see this, refit the regression model removing the variables that are not significant. <>= qrm2 = rq(W ~ SST, tau=.5) summary(qrm2) @ Now the relationship indicates that for every 1$^\circ$C increase in SST the median intensity increases by \Sexpr{round(qrm2$coefficients["SST"],2)} m~s$^{-1}$, but this amount is not statistically significant as you might have guessed from your exploratory plot. The theory of maximum potential intensity relates a theoretical highest wind speed to ocean temperature so it is interesting to consider quantiles above the median. You repeat the modeling exercise using $\tau$ = 0.9. Here again you find year and SOI not significant so you exclude them in your final model. <>= summary(rq(W ~ SST, tau=.9), se="iid") @ <
>= qrm = rq(W ~ SST, tau=.9) s = summary(qrm, se="iid") tbl = xtable(s$coefficients, label='tab:coefquantreg', caption='Coefficients of the 90th percentile regression model.') print(tbl, math.style.negative=TRUE, caption.placement="top") @ Here instead of the rank-inversion CI you obtain a more conventional table of coefficients (Table~\ref{tab:coefquantreg}) that includes standard errors, $t$-statistics, and $p$-values using the \verb@se="iid"@ argument in the \verb@summary@ function. As anticipated from theory and your exploratory data analysis, you see a statistically significant positive relationship between cyclone intensity and SST for the set of tropical cyclones within the top 10\% of intensities. The estimated coefficient indicates that for every 1$^\circ$C increase in SST the upper percentile intensity increases by \Sexpr{round(qrm$coefficients[2],1)} m~s$^{-1}$. <>= iid = summary(rq(W ~ SST, tau=.9), se="iid") boot = summary(rq(W ~ SST, tau=.9), se="boot") percinc = (boot$coef[2, 2] - iid$coef[2, 2])/iid$coef[2, 2] * 100 @ Other options exist for computing standard errors including a bootstrap approach (\verb@se="boot"@; see \cite{Koenker2005}), which produces a standard error in this case of \Sexpr{round(boot$coef[2,2],2)} (difference of \Sexpr{round(percinc)}\%). The larger standard error results in a significance level that is somewhat less, but the results still provide conclusive evidence of a climate change signal. To visualize the intensity-SST relationship in more detail you plot several quantile regression lines on a scatter plot. For reference you include the least-squares regression line. The code is given below and the results are shown in Fig.~\ref{fig:scatter:quantreg}. Note that you use \verb@type="n"@ in the plot function and use the \verb@points@ function to add the points last so the lines do not cover them. \begin{figure} \centering <>= par(las=1, mgp=c(2, .4, 0), tcl=-.3) plot(SST, W, type="n", xlab="SST [$^\\circ$C]", ylab="Wind speed [m s$^{-1}$]") abline(lm(W ~ SST), col="red", lwd=2) taus = c(.1, .25, .5, .75, .9) for(i in 1:length(taus)){ abline(rq(W ~ SST, tau=taus[i]), col="gray", lwd=2) } points(SST, W, pch=19, cex=.4) @ \vspace{-.75cm} \caption{Quantile regressions of lifetime maximum intensity on SST.} \label{fig:scatter:quantreg} \end{figure} The 0.1, 0.25, 0.5, 0.75, and 0.9 quantile regression lines are shown in gray and the least squares regression line about the mean is shown in red. Trend lines are close to horizontal for the weaker tropical cyclones but have a significant upward slope for the stronger cyclones. To see all the distinct quantile regressions for a particular model you specify a \verb@tau=-1@. For example, save the quantile regressions of wind speed on SST and SOI in an object by typing <>= model = rq(W ~ SST + SOI, tau=-1) @ This will cause the \verb@rq@ function to find the entire sample path of the quantile process. The returned object is of class {\tt rq.process}. You plot the regression coefficients for each variable in the model as a function of quantile by typing <>= plot.rq.process(model) @ \begin{figure} \centering <>= model = rq(W ~ SST + SOI, tau=seq(.025, .975, .05)) labs = round(quantile(W, seq(.2, .8, .2))) par(las=1, mgp=c(2, .4, 0), tcl=-.3) plot(summary(model, se="boot"), parm=2, lcol="transparent", xaxt="n", mar=c(5, 5, 4, 2) + .1, pch=16, lwd=2, ylab="$\\beta_{SST}$ [m s$^{-1}$/$^\\circ$C]", xlab="Wind speed quantile [$\\tau$]", main="") grid() abline(h=0) axis(3, at=seq(.2, .8, .2), labels=labs) mtext("Lifetime highest wind speed [m s$^{-1}$]", side=3, line=2) @ \vspace{-.75cm} \caption{SST coefficient from a regression of LMI on SST and SOI.} \label{fig:quantreg:SST} \end{figure} The result for the SST variable is plotted in Fig.~\ref{fig:quantreg:SST}. Values of $\tau$ range from 0.025 to 0.975 in intervals of 0.05. The 95\% confidence band (gray) is based on a bootstrap method. The plot shows the rising trend of the most intense hurricanes as the ocean temperatures rise after accounting for the El N\~no. The trends depart from zero for quantiles above about 0.4 and become significant for cyclones exceed about 50 m~s$^{-1}$. Additional capabilities for quantile modeling and inference are available in the {\bf quantreg} package. Next we consider a model for the most intense hurricanes. \section{Fastest Hurricane Winds} Eighty percent of all hurricane damage is caused by fewer than 20\% of the worst events \citep{JaggerEtAl2008}. The rarity of severe hurricanes implies that empirical models that estimate the probability of the next big one will be unreliable. However, extreme value theory provides a framework for statistically modeling these rare wind events. Here you employ these models on hurricane wind speeds. This is particularly important since you saw in the previous section that the strongest are getting stronger. First some exploratory analysis. \subsection{Exploratory analysis} A histogram is a good place to start. Here you plot the lifetime maximum wind speeds for all North Atlantic tropical cyclones from the period 1967--2010 as a histogram. You use the same data set as in \S\ref{sec:lifetimehighestintensity}, where \verb@W@ is the vector of wind speeds. <>= W = LMI.df$WmaxS hist(W, main="", las=1, col="gray", border="white", xlab="Wind Speed (m/s)") rug(W) @ \begin{figure} \centering <>= W = LMI.df$WmaxS par(las=1, mgp=c(2, .4, 0), tcl=-.3) hist(W, main="", las=1, col="gray", border="white", xlab="Wind speed [m s$^{-1}$]") rug(W) @ \vspace{-.75cm} \caption{Histogram of lifetime highest intensity.} \label{fig:hist:Wmax} \end{figure} The function uses 5 m~s$^{-1}$ intervals by default although the minimum intensity is \Sexpr{round(min(W),1)} m~s$^{-1}$. Figure~\ref{fig:hist:Wmax} shows a peak in the distribution between 20 and 40 m~s$^{-1}$ and a long right tail. Values in the tail are of interest. For a model of the fastest winds you want to include enough of these high values that your parameter estimates are reliable (they don't change by much if you add or remove a few values). But you also want to be careful not to include too many to ensure that you exclude values representing the more typical cyclones. \subsection{Return periods} Your interest is the return period for the strongest cyclones. The return period is the average recurrence interval, where the recurrence interval is the time between successive cyclones of a given intensity or stronger (events). Suppose you define an event as a hurricane with a threshold intensity of 75 m~s$^{-1}$, then the annual return period is the inverse of the probability that such an event will be exceeded in any one year. Here `exceeded' refers to a hurricane with intensity of at least 75 m~s$^{-1}$. For instance, a 10-year hurricane event has a 1/10 = 0.1 or 10\% chance of having an intensity exceeding a threshold level in any one year and a 50-year hurricane event has a 0.02 or 2\% chance of having an intensity exceeding a higher threshold level in any one year. These are statistical statements. On average, a 10-year event will occur once every 10 years. The interpretation requires that for a year or set of years in which the event does not occur, the expected time until it occurs next remains 10 years, with the 10-year return period resetting each year. Note there is a monotonic relationship between the intensity of the hurricane event (return level) and the return period. The return period for a 75 m~s$^{-1}$ return level must be longer than the return period for a 70 m~s$^{-1}$ return level. The empirical relationship is expressed as \begin{equation} \hbox{RP}=\frac{n+1}{m} \end{equation} where $n$ is the number of years in the record and $m$ is the intensity rank of the event.\footnote{Sometimes $n/(m-0.5)$ is used instead.} You use this formula to estimate return periods for your set of hurricanes. First assign the record length and sort the lifetime maximum wind speeds in decreasing order. Then list the speeds of the six most intense hurricanes. <>= n = length(1967:2010) Ws = sort(W, decreasing=TRUE) round(Ws, 1)[1:6] @ Finally, compute the return period for these six events using the above formula rounding to the nearest year. <>= m = rev(rank(Ws)) round((n + 1)/m, 0)[1:6] @ Thus a \Sexpr{round(Ws,1)[1]} m~s$^{-1}$ hurricane has a return period of \Sexpr{round((n+1)/m,0)[1]} years and a \Sexpr{round(Ws,1)[6]} m~s$^{-1}$ hurricane has a return period of \Sexpr{round((n+1)/m,0)[6]} years. Said another way, you can expect a hurricane of at least \Sexpr{round(Ws,1)[3]} m~s$^{-1}$ once every \Sexpr{round((n+1)/m,0)[3]} years. The threshold wind speed for a given return period is called the return level. Your goal here is a statistical model that provides a continuous estimate of the return level (threshold intensity) for a set of return periods. A model is more useful than a set of empirical estimates because it provides a smoothed return level estimate for all return periods and it allows you to estimate the return level for a return period longer than your data record. The literature provides some examples. \cite{RuppLander1996} use the method of moments on annual peak winds over Guam to determine the parameters of an extreme value distribution leading to estimates of return periods for extreme typhoon winds. \cite{HeckertEtAl1998} use the peaks-over-threshold method and a reverse Weibull distribution to obtain return periods for extreme wind speeds at locations along the U.S. coastline. \cite{Walshaw2000} use a Bayesian approach to model extreme winds jointly from tropical and non-tropical systems. \cite{JaggerElsner2006} use a maximum-likelihood and Bayesian approach to model tropical cyclone winds in the vicinity of the United States conditional on climate factors. In the former study, the Bayesian approach allows them to take advantage of information from nearby sites, in the later study it allows them to take advantage of older, less reliable, data. Here you use functions in the {\bf ismev} package \citep{ColesStephenson2011} to fit an extreme value model for hurricane winds using the method of maximum likelihoods. We begin with some background material. An excellent introduction is provided in \cite{Coles2001}. \subsection{Extreme value theory} Extreme value theory is a branch of statistics. It concerns techniques and models for describing the rare event rather than the typical, or average, event. It is similar to the central limit theory. Both consider the limiting distributions of independent identically distributed (iid) random variables under an affine transformation.\footnote{Linear transformation followed by a translation.} According to the central limit theorem, the mean value of an iid random variable $x$ converges to a normal distribution with mean 0 and variance 1 under the affine transformation $(\bar x - \mu)/\sqrt{n\sigma^2})$, where $\mu$ and $\sigma$ are the mean and standard deviation of $x$, respectively. Similarly, if the distribution of the maxima under some affine transformation converges, then it must converge to a member of the generalized extreme value (GEV) family \citep{EmbrechtsEtAl1997}. The maxima of most continuous random variables converge to a non-degenerate random variable. This asymptotic argument is used to motivate the use of extreme value models in the absence of empirical or physical evidence for assigning an extreme level to a process. However, the argument does not hold for the maxima of discrete random variables including the Poisson and negative binomial. Although by definition extreme values are scarce, an extreme value model allows you to estimate return periods for hurricanes that are stronger than the strongest one in your data set. In fact, your goal is to quantify the statistical behavior of hurricane activity extrapolated to unusually high levels. Extreme value theory provides models for extrapolation. Given a set of observations from a continuous process, if you generate a sample from the set, take the maximum value from the sample, and repeat the procedure many times, you obtain a distribution that is different from that of the original (parent) distribution. For instance, if the original distribution is described by a normal, the distribution of the maxima is quantified by a Gumbel distribution. To see this plot a density curve for the standard normal distribution and compare it to the density curve of the maxima from samples of size 100 taken from the same distribution. Here you generate 1000 samples saving the maxima in the vector \verb@m@. <>= par(mfrow=c(1, 2)) curve(dnorm(x), from=-4, to=4, ylab="Density") m = numeric() for(i in 1:1000) m[i] = max(rnorm(100)) plot(density(m), xlab="Maxima of x", main="") @ \begin{figure} \centering <>= par(mfrow=c(1, 2), las=1, mgp=c(2, .4, 0), tcl=-.3) curve(dnorm(x), from=-4, to=4, ylab="Density", lwd=2) mtext("a", side=3, line=1, adj=0, cex=1.1) m = numeric() for(i in 1:1000) m[i] = max(rnorm(100)) plot(density(m), xlab="Maxima of x", main="", lwd=2) mtext("b", side=3, line=1, adj=0, cex=1.1) @ \vspace{-.75cm} \caption{Density curves. (a) Standard normal and (b) maxima from samples of the standard normal.} \label{fig:samplemax} \end{figure} The results are shown in Fig.~\ref{fig:samplemax}. The maxima belong to a GEV distribution that is shifted relative to the parent distribution and positively skewed. The three parameters of the GEV distributions are determined by the values in the tail portion of the parent distribution. \subsection{Generalized Pareto distribution} A GEV distribution fits the set of values consisting of the single strongest hurricane each year. Alternatively, consider the set of per-cyclone lifetime strongest winds in which you keep all values exceeding a given threshold level, say 60 m~s$^{-1}$. Some years will contribute no values to your set and some years will contribute two or more values. A two-parameter generalized Pareto distribution (GPD) family describes this set of fast winds. The threshold choice is a compromise between retaining enough tropical cyclones to estimate the distribution parameters with sufficient precision, but not too many that the intensities fail to be described by a GPD. Specifically, given a threshold wind speed $u$ you model the exceedances, $W-u$, as samples from a GPD family so that for an individual hurricane with maximum winds $W$, the probability that $W$ exceeds any value $v$ given that it is above the threshold $u$ is given by \begin{eqnarray} p(W > v | W > u)&=&\left\{\begin{array}{cc} \exp([-(v-u)]/\sigma) & \mbox{ when } \xi = 0 \vspace{5 pt}\\ ( 1+\frac{\xi}{\sigma} [v-u])^{-1/\xi} & \mbox{otherwise} \\ \end{array} \right. \end{eqnarray} where $\sigma > 0$ and $\sigma + \xi (v-u) \ge 0$. The parameters $\sigma$ and $\xi$ are scale and shape parameters of the GPD, respectively. Thus you can write $p(W > v | W > u) = $ GPD$(v-u |,\sigma, \xi)$. To illustrate, copy the following code to create a function \verb@sGpd@ for the exceedance probability of a GPD. <>= sGpd = function(w, u, sigma, xi){ d = (w - u) * (w > u) sapply(xi, function(xi) if(xi==0) exp(-d/sigma) else ifelse(1 + xi/sigma * d < 0, 0, (1 + xi/sigma * d)^(-1/xi))) } @ Given a threshold intensity $u$, the function computes the probability that a hurricane at this intensity or higher picked at random will have a maximum wind speed of at least $W$. The probability depends on the scale and shape parameters. For instance, given a scale of 10 and a shape of 0, the probability that a random hurricane will have a maximum wind speed of at least 70 m~s$^{-1}$ is obtained by typing <>= sGpd(w=70, u=60, sigma=10, xi=0) @ The scale parameter controls how fast the probability decays for values near the threshold. The decay is faster for smaller values of $\sigma$. The shape parameter controls the length of the tail. For negative values of $\xi$ the probability is zero beyond a certain intensity. With $\xi = 0$ the probability decay is exponential. For positive values of $\xi$ the tail is described as `heavy' or `fat' indicating a decay in the probabilities gentler than logarithmic. Figure~\ref{fig:gpd} compares exceedance curves for different values of $\sigma$ with $\xi=0$ and for different values of $\xi$ with $\sigma=10$ keeping the threshold value at 60~m~s$^{-1}$. \begin{figure} \centering <>= par(mfrow=c(1, 2), las=1, mgp=c(2, .4, 0), tcl=-.3) curve(sGpd(x, u=60, sigma=5, xi=0), from=60, to=100, lwd=2, col="green", xlab="Wind speed [m s$^{-1}$]", ylab="Exceedance probability") curve(sGpd(x, u=60, sigma=10, xi=0), from=60, to=100, add=TRUE, lwd=2) curve(sGpd(x, u=60, sigma=15, xi=0), from=60, to=100, add=TRUE, lwd=2, col="red") legend("topright", legend=c("$\\sigma$ = 5", "$\\sigma$ = 10", "$\\sigma$ =15"), lwd=2, col=c(3, 1, 2), cex=.8) mtext("a", side=3, line=1, adj=0, cex=1.1) curve(sGpd(x, u=60, sigma=10, xi=-.5), from=60, to=100, lwd=2, col="green", xlab="Wind speed [m s$^{-1}$]", ylab="Exceedance probability") curve(sGpd(x, u=60, sigma=10, xi=0), from=60, to=100, add=TRUE, lwd=2) curve(sGpd(x, u=60, sigma=10, xi=.5), from=60, to=100, add=TRUE, lwd=2, col="red") legend("topright", legend=c("$\\xi$ = -0.5", "$\\xi$ = 0", "$\\xi$ = +0.5"), lwd=2, col=c(3, 1, 2), cex=.8) mtext("b", side=3, line=1, adj=0, cex=1.1) @ \vspace{-.75cm} \caption{Exceedance curves for the generalized Pareto distribution. (a) Different $\sigma$'s with $\xi=0$ and (b) different $\xi$'s with $\sigma=10$.} \label{fig:gpd} \end{figure} \subsection{Extreme intensity model} Given your set of lifetime maximum wind speeds in the object \verb@W@, you use the \verb@gpd.fit@ function from the {\bf ismev} package to find the scale and shape parameters of the GPD using the method of maximum likelihood. Here you set the threshold wind speed to 62 m~s$^{-1}$ as a compromise between high enough to capture only the strongest cyclones but low enough to have a sufficient number of wind speeds. The output is saved as a list object and printed to your screen. <>= require(ismev) model = gpd.fit(W, threshold=62) @ This is a probability model that specifies the chance of a random hurricane obtaining any intensity value given that it has already reached the threshold. The function prints the threshold value, the number of extreme winds in the data set (\verb@nexc@) as defined by the threshold, the negative log-likelihood value (\verb@nllh@), the maximum likelihood parameter estimates (\verb@mle@) and the rate, which is the number of extreme winds divided by the total number of cyclones (per cyclone rate). You use your \verb@sGpd@ function to compute probabilities for a sequence of winds from the threshold value to 85 m~s$^{-1}$ in increments of 0.1 m~s$^{-1}$. <>= v = seq(63, 85, .1) p = sGpd(v, u=62, sigma=model$mle[1], xi=model$mle[2]) @ You then use the plot method to graph them. <>= plot(v, p, type="l", lwd=2, xlab="Wind Speed (m/s)", ylab="p(W > v | W > 62)") @ To turn the per cyclone rate into an annual rate you divide the number of extreme winds by the record length. <>= rate = model$nexc/length(1967:2010) rate @ Thus the annual rate of hurricanes at this intensity or higher over the \Sexpr{length(1967:2010)} years in the data set is \Sexpr{round(rate,2)} cyclones per year. Recall from the Poisson distribution that this implies a <>= round((1 - ppois(0, rate)) * 100, 2) @ percent chance that next year a hurricane will exceed this the threshold. \subsection{Intensity and frequency model} The GPD describes hurricane intensities above a threshold wind speed. You know from Chapter~\ref{chap:frequencymodels} that the Poisson distribution describes hurricane frequency. You need to combine these two descriptions. Let the annual number of hurricanes whose lifetime maximum intensity exceeds $u$ have a Poisson distribution with mean rate $\lambda_u$. Then the average number of hurricanes with winds exceeding $v$ (where $v \ge u$) is given by \begin{equation} \lambda_v = \lambda_u \times p(W > v | W > u) \end{equation} This allows you to model hurricane occurrence separate from hurricane intensification. This is helpful because processes that govern hurricane formation are not necessarily the same as the processes that govern intensification. Moreover from a practical perspective, rather than a return rate per hurricane occurrence, the above specification allows you to obtain an annual return rate on the extreme winds. This is more meaningful for the business of insurance. Now, the probability that the highest lifetime maximum intensity in a given year will be less than $v$ is \begin{eqnarray} p(W_{\hbox{max}} \le v) &=& \exp(-\lambda_v) \\ &=& \exp[-\lambda_u \times \hbox{GPD}(v-u | \sigma, \xi)] \end{eqnarray} The return period RP is the inverse of the probability that $W_{\hbox{max}}$ exceeds $v$, where $v$ is called the return level. You compute the return period and create a return period plot using <>= rp = 1/(1 - exp(-rate * p)) plot(rp, v, type="l", lwd=2, log="x", xlab="Return Period (yr)", ylab="Return Level (m/s)") @ Figure~\ref{fig:returnlevelbasin} shows the results. Return levels increase with increasing return period. The model estimates an \Sexpr{round(v[185])}~m~s$^{-1}$ hurricane will occur on average once every \Sexpr{round(rp[185])} years and an \Sexpr{round(v[length(v)])}~m~s$^{-1}$ hurricane will occur on average once every \Sexpr{round(rp[length(v)])} years. However, based on the results and discussion in \S\ref{sec:lifetimehighestintensity}, these return periods might be getting shorter. \begin{figure} \centering <>= par(las=1, mgp=c(2, .4, 0), tcl=-.3) rp = 1/(1 - exp(-rate * p)) v = seq(63, 85, .1) plot(rp, v, type="l", lwd=2, log="x", xlab="Return period [yr]", ylab="Return level [m s$^{-1}$]") grid(equilogs=FALSE) @ \vspace{-.75cm} \caption{Return periods for the fastest winds.} \label{fig:returnlevelbasin} \end{figure} \subsection{Confidence intervals} <>= v = 73 p = sGpd(v, u=62, sigma=model$mle[1], xi=model$mle[2]) rp = round(1/(1 - exp(-rate * p))) @ You obtain confidence limits on the return period estimates shown in Fig.~\ref{fig:returnlevelbasin} using a bootstrap approach (see Chapter~\ref{chap:classicalstatistics}). Suppose you are interested in the 95\% CI on the return period of a 73 m~s$^{-1}$ hurricane. Your model tells you that the best estimate for the return period is \Sexpr{rp} years. To obtain the CI you randomly sample your set of wind speeds with replacement to create a bootstrap replicate. You run your model on this replicate and get an estimate of the return period. You repeat this procedure 1000 times each time generating a new return period estimate. You then treat the bootstrapped return periods as a distribution and find the lower and upper quantiles corresponding to the 0.025 and 0.975 probabilities. To implement this procedure you type <>= thr = 62 v = 73 rps = numeric() m = 1000 for(i in 1:m){ Wbs = sample(W, size=length(W), replace=TRUE) modelbs = gpd.fit(Wbs, threshold=thr, show=FALSE) ps = sGpd(v, u=thr, sigma=modelbs$mle[1], xi=modelbs$mle[2]) rps[i] = 1/(1 - exp(-rate * ps)) } ci = round(quantile(rps, probs=c(.025, .975))) @ The procedure provides a 95\% CI of (\Sexpr{ci[1]}, \Sexpr{ci[2]}) years about the estimated \Sexpr{rp}-year return period for a 73 m~s$^{-1}$ hurricane. You can estimate other CIs (e.g., 90\%) by specifying different percentiles in the \verb@quantile@ function. \subsection{Threshold intensity} The GPD model requires a threshold intensity $u$. The choice is a tradeoff between an intensity high enough that the positive residual values ($W-u \ge 0$) follow a GPD, but low enough that there are values to accurately estimate the GPD parameters. For an arbitrary intensity level you can compute the average of the positive residuals (excesses). For example, at an intensity of 60~m~s$^{-1}$, the mean excess in units of m~s$^{-1}$ is <>= mean(W[W >= 60] - 60) @ By increasing the level, say to 70~m~s$^{-1}$, the mean excess decreases to \Sexpr{round(mean(W[W>=70]-70),2)}~m~s$^{-1}$. In this way you compute a vector of mean excesses for a range of potential threshold intensities. The relationship between the mean excess and threshold is linear if the residuals follow a GPD. A plot of the mean excess across a range of intensity levels (mean residual life plot) helps you choose a threshold intensity. The function \verb@mrl.plot@ is part of the {\bf ismev} package makes the plot for you. Type <>= mrl.plot(W) grid() @ The result is shown in Fig.~\ref{fig:mrlplot}. The 95\% confidence band is shown in gray. There is a general decrease of the mean excess with increasing intensity levels. The decrease is linear above a value of about 62~m~s$^{-1}$ indicating that any threshold chosen above this intensity results in a set of wind speeds that follow a GPD. To maximize the number of wind speed values for estimating the model parameters, the lowest such threshold is the optimal choice. <>= mrl.plot2 = function (data, umin=min(data), umax=max(data) - .1, conf=.95, nint=100) { x <- xu <- xl <- numeric(nint) u <- seq(umin, umax, length = nint) for (i in 1:nint) { data <- data[data > u[i]] x[i] <- mean(data - u[i]) sdev <- sqrt(var(data)) n <- length(data) xu[i] <- x[i] + (qnorm((1 + conf)/2) * sdev)/sqrt(n) xl[i] <- x[i] - (qnorm((1 + conf)/2) * sdev)/sqrt(n) } plot(u, x, type = "n", lwd=2, xlab="Intensity level [m s$^{-1}$]", ylab = "Mean excess [m s$^{-1}$]", ylim = c(min(xl[!is.na(xl)]), max(xu[!is.na(xu)]))) xx = c(u[!is.na(xl)], rev(u[!is.na(xu)])) yy = c(xl[!is.na(xl)], rev(xu[!is.na(xu)])) polygon(xx, yy, border=NA, col="gray") lines(u[!is.na(xl)], x[!is.na(xl)], lwd=2) } @ \begin{figure} \centering <>= par(las=1, mgp=c(2, .4, 0), tcl=-.3) mrl.plot2(W) grid() @ \vspace{-.75cm} \caption{Mean excess as a function of threshold intensity.} \label{fig:mrlplot} \end{figure} Alternatively you can proceed by trial-and-error. You calculate the parameters of the GPD for increasing thresholds and choose the minimum threshold at which the parameter values remain nearly fixed. \section{Categorical Wind Speeds by County} Hurricane wind speeds are often given by Saffir-Simpson category. While this is can be useful for presentations, you should avoid using categories for analysis and modeling. However, historical data are sometimes provided only by category. Here you model county-level categorical wind data. The data represent direct and indirect hurricane hits by Saffir-Simpson category and are described and organized in Chapter~\ref{chap:datasets}. The wind speed category and count data are saved in separate binary files. Make them available in your working directory by typing <>= load("catwinds.RData") load("catcounts.RData") @ The list of data frames is stored in the object {\tt winds}. Recall that lists are generic objects and can be of any type. To see the data frame for Cameron County, Texas (the first county in the list where the counties are numbered from 1 to 175 starting with south Texas), type <>= winds[[1]] @ The data frame contains a numerical year variable and a categorical survival variable. The survival variable has three components with the first two indicating the wind speed bounds of the cyclone. The bounds correspond to Saffir-Simpson cyclone categories. The data frame of corresponding hurricane counts is stored in the object {\tt counts}. To see the first ten years of counts from Cameron County, type <>= counts[1:10, 1] @ There were no hurricanes in this part of Texas during the first nine years of the 20th century, but there were two in 1909. The first eight county names are printed by typing <>= colnames(counts)[1:8] @ You use the two-parameter Weibull distribution to model the wind speed category data. The survival function ($S(w)=P(W>w)$) for the Weibull distribution ($W \sim \mathrm{Weib}(b,a)$) is \begin{equation} S(w)=\exp (- \left(\frac{w}{b}\right)^a) \label{eq:weibullsurvival} \end{equation} where $a$ and $b$ are the shape and scale parameters, respectively. The Weibull distribution has the nice property that if $W \sim \mathrm{Weib}(a,b)$, then a linear transformation of $W$ results in a variable whose distribution is also Weibull [i.e. $kW \sim \mathrm{Weib}(kb,a)$]. Similarly, a power transformation results in a variable whose distribution is Weibull [i.e. $W^k \sim \mathrm{Weib}(b^k,a/k)$]. However your data does not contain single wind speed values. Instead for a particular cyclone, the affected county has a lower and upper wind speed bound. This is called censored data. You know the wind speed is at least as strong as the lower bound but as strong or weaker than the upper bound.\footnote{Censored data attaches {\tt.time1} and {\tt.time2} to the bounds, but here the winds are from the same time.} In other words, $W$ lies in an interval $[W_l, W_u]$ and the true wind speed follows a Weibull distribution. So instead of using the logarithm of the density function in the Weibull likelihood, you use the logarithm of the probability distribution function over the interval. \subsection{Marked Poisson process} These data were originally modeled in \cite{JaggerEtAl2001}. This model considered annual winds by keeping only the highest wind event for that year. That is, a county that was hit by multiple hurricanes in a given year, only the strongest wind was used. Here you reframe the analysis using a marked Poisson process meaning the wind events are independent and the number of events follows a Poisson distribution with a rate $\lambda$. The marks are the wind speed interval associated with the event. In this way, all events are included. You assume the marks have a Weibull distribution with shape parameter $a$ and scale parameter $b$. The scale parameter has units of wind speed in m~s$^{-1}$. Note that the mean exceedance wind speed is given by $\mu=b\Gamma(1+1/a)$ as can be seen by integrating the survival function (Eq.~\ref{eq:weibullsurvival}). The probability that the yearly maximum wind is less than or equal to $w$ can be found by determining the probability of not seeing a wind event of this magnitude. Given the rate of events ($\lambda$) and the probability of an event exceeding $w$, the rate of events exceeding $w$ is a thinned Poisson process with rate given by \begin{equation} r(w)=\lambda \exp\left(-(w/b)^a\right) \label{eq:thinnedpoisson} \end{equation} So the probability of observing no events is $\exp(-r(w))$ and thus the probability distribution of the yearly maximum winds is given by \begin{equation} F_{max}(w) = \exp\left(-\lambda \exp(-(w/b)^a)\right). \label{eq:yearlymaxdistribution} \end{equation} The return level ($w$) in years ($n$) associated with the return period is given $1/n=1-F(w)$, the long run proportion of years with events exceeding $w$. Solving for $w$ gives \begin{equation} w = b \left(\log \left(\frac{\lambda}{\log \left(\frac{n}{n-1}\right)} \right)\right)^{\frac{1}{a}} \label{eq:returnlevel} \end{equation} which is approximately \begin{equation} w \approx b \left(\log (\lambda (n-.5) )\right) ^ \frac{1}{a}. \label{eq:returnlevelapprox} \end{equation} \subsection{Return levels} To help with the modeling we packaged the functions Weibull survival (\verb@sWeib@), distribution of maximum winds (\verb@sWeibMax@), and return level (\verb@rlWeibPois@) in the file {\bf CountyWinds.R}. Use the \verb@source@ function to input these functions by typing <>= source("CountyWinds.R") @ To see how these functions work, suppose the annual hurricane rate for a county is $\lambda$ = 0.2 and the Weibull survival parameters are $a$ = 5 and $b$ = 50~m~s$^{-1}$. Then, to estimate the return level associated with a 100-year return period hurricane wind event, you type <>= rlWeibPois(n=100, a=5, b=50, lambda=.2) @ Thus you can expect to see a hurricane wind event of magnitude \Sexpr{round(rlWeibPois(n=100,a=5,b=50,lambda=.2),1)}~m~s$^{-1}$ in the county, on average, once every 100 years. Note that since the event frequency is 1 in 5 years (0.2), the return period in years is given by 1/[1-exp(-0.2)] or <>= round(1/(1 - exp(-.2))) @ Note also that the Weibull distribution has support on the real number line to positive infinity (see Chapter~\ref{chap:classicalstatistics}). This means there will be a non-zero probability of a wind exceeding any magnitude. Physically this is not realistic. You can generate a series of return levels using the \verb@rlWeibPois@ function and the assigned parameters by typing <>= rl = round(rlWeibPois(n=c(5, 10, 20) * 10^(rep(0:2, each=3)), a=5, b=50, lambda=.2), 1) rl @ Thus, on average, the county can expect to see a cyclone of \Sexpr{rl[1,2]}~m~s$^{-1}$ once every 10 years. For a given return period, the return level scales linearly with the scale parameter $b$, but to a power of $1/a$ with the shape parameter. Note that the function returns an \verb@NaN@ (not a number) for the 5-year return level since it is below 33~m~s$^{-1}$. \subsection{Covariates} The return level computation above assumes all years have equal probability of events and equal probability of wind speed exceedances. This is called a climatology model. You might be able to do better by conditioning on environmental factors. You include covariate affects by modeling the transformed parameters $\log \lambda$, $\log b$ and $\log a$ as linear functions of the covariates NAO, SST, SOI and SSN. For a given county, let $L_i$ and $U_i$ be the lower and upper bounds for each observation as given in the Table~\ref{tab:windspeedranges} and $y_j$ be the yearly cyclone count. Further, assume that $[\theta_{\lambda},\theta_b,\theta_a]$ is a vector of model parameters associated with covariate matrices given as $\X_{\lambda}$,$\X_{b}$ and $\X_{a}$ of size $m \times p_{\lambda}$,$n \times p_a$ and $n \times p_b$, respectively. The log likelihood function of the process for a given county with $n$ observations over $m$ years is \begin{eqnarray*} \mathrm{LL}(\theta)&=&\sum_{i=1}^n \log\left(\exp(-(L_i/b_i)^{a_i}) -\exp(-(U_i/b_i)^{a_i})\right) + \\ && \sum_{j=1}^m y_j \log(\lambda_i) -\lambda_i - \log(i!)\\ \log(a_i)&=&\X_{a}[i,]\cdot \theta_a \\ \log(b_i)&=& \X_{mu}[i,] \cdot \theta_{b} \\ \log(\lambda_i) &=& \X_{\lambda}[i,] \cdot \theta_{\lambda} \\ \end{eqnarray*} The log likelihood separates into two parts one for the counts and another for the wind speeds. This allows you to use maximum likelihood estimation (MLE) of the count model parameters separately from the wind speed model parameters. The count model is a generalized linear model and you can use the \verb@glm@ function as you did in Chapter~\ref{chap:frequencymodels}. For the wind speeds, you can build the likelihood function (see \cite{JaggerEtAl2001}) or using a package. The advantage of the latter is greater functionality through the use of plot, summary, and predict methods. This is nice. You can usually find a package to do what you need using familiar methods. If not, you can write an extension to an existing package. If you write an extension send it to the package maintainer so your functionality gets added to future versions. The {\bf gamlss} package together with the {\bf gamlss.dist} package provide extensions to the \verb@glm@ function from the {\bf stats} package and to the \verb@gam@ function from the {\bf gam} package for generalized additive models. The {\bf gamlss.cens} package allows you to fit parametric distributions to censored and interval data created using the \verb@Surv@ function in the {\bf Survival} package for use with the {\bf gamlss.dist} package. With this flexibility you can estimate the parameters of the return level model without the need to writing code for the likelihood or its derivatives. You make the packages available to your working directory by typing <>= require(gamlss) require(gamlss.cens) @ You are interested in estimating return levels at various return periods. You can do this using the MLE of the model parameters along with a set of covariates using \verb@rlWeibPois@ as described above. The estimates have a degree of uncertainty due to finite sample size. The estimated model parameters also have uncertainty. You propagate this uncertainty to your final return level estimates in at least two ways. One is to estimate the variance of the return level as a function of the parameter covariance matrix (delta method). Another is to sample the parameters assuming they have a normal distribution with a mean equal to the MLE estimate and with a variance-covariance matrix given by $\Sigma$, where $\Sigma$ is a block diagonal matrix composed of a $p_\lambda \times p_\lambda$ covariance matrix from the count model and a $p_a + p_b \times p_a + p_b$ covariance matrix from the wind speed model. The parameters and the covariances are returned from the \verb@vcov@ function on the model object returned from \verb@glm@ and \verb@gamlss@. The latter is done as follows. First you generate samples of the transformed parameters and save them in separate vectors ($\log \lambda, \log b, \log a$). Then you take the antilog of the inner product of the parameters and the corresponding set of their predictions based on the covariates. You then pass these values and your desired return periods to \verb@rlWeibPois@ to obtain one return level for each return period of interest. Finally you use the function \verb@sampleParameters@ provided in \verb@CountyWinds.R@ to sample the return levels for a given set of predictors and return periods. \subsection{Miami-Dade} As an example, here you model the categorical wind data for Florida's Miami-Dade County. The model can be applied to any county that has experienced more than a few hurricanes. Since not all counties have the same size, comparing wind probabilities across counties is not straightforward. On the other hand, county-wide return levels are useful to local officials. You will model the county data with and without the covariates. First you extract the wind speed categories and the counts. <>= miami.w = winds[[57]] Year = as.numeric(row.names(counts)) H = counts[, 57] miami.c = data.frame(Year=Year, H=H) @ Since you have two separate data sets it is a good idea to see if the cyclone counts match the winds by year and number. You do this by typing <>= all(do.call("data.frame", rle(miami.w$Year))- miami.c[miami.c$H > 0, c(2, 1)] == 0) @ You first fit the counts to a Poisson distribution by typing <>= fitc = glm(H ~ 1, data=miami.c, family="poisson") @ Next you fit the wind speed intervals to a Weibull distribution by typing <>= WEIic = cens(WEI, type="interval", local=FALSE) fitw = gamlss(W ~ 1, data=miami.w, family=WEIic, trace=FALSE) @ %Bug in vcov.gamlss <>= rm(list=c("W", "Year")) @ Finally you generate samples of return levels for a set of return periods by typing <>= rp = c(5, 10, 20) * 10^(rep(0:2, each=3)) rl = sampleParameters(R=1000, fitc=fitc, fitw=fitw, n=rp) @ You display the results with a series of box plots. <>= boxplot(rl[, , ], xlab="Return Period (yr)", ylab="Return Level (m/s)", main="Miami-Dade") @ The results are shown in Fig.~\ref{fig:returnlevelMiami}, which also includes the same plot using data from Galveston, Texas. The median return level is shown with dot. Given the model and the data a 50-year return period is a hurricane that produces winds of at least 60 m~s$^{-1}$ in the county. The return level increases with increasing return period. The uncertainty levels represent the upper and lower quartile values and the ends of the whiskers define the 95\% confidence interval. \begin{figure} \centering <>= par(mfrow=c(1, 2), las=1, mgp=c(2, .4, 0), tcl=-.3) xvalues = 1:dim(rl)[3] plot(rp, rep(50, dim(rl)[3]), type="n", log="x", ylim=c(20, 80), xlab="Return period [yr]", ylab="Return level [m s$^{-1}$]") grid(equilogs=FALSE) mtext("a", side=3, line=1, adj=0, cex=1.1) for(i in xvalues){ points(rp[i], quantile(rl[, 1, i], prob=.5, na.rm=TRUE), pch=19, col="black") lines(c(rp[i], rp[i]), c(quantile(rl[, 1, i], prob=.25, na.rm=TRUE), quantile(rl[, 1, i], prob=.025, na.rm=TRUE))) lines(c(rp[i], rp[i]), c(quantile(rl[, 1, i], prob=.75, na.rm=TRUE), quantile(rl[, 1, i], prob=.975, na.rm=TRUE))) } galv.w = winds[[13]] Year = as.numeric(row.names(counts)) H.g = counts[, 13] galv.c = data.frame(Year=Year, H.g=H.g) fitc.g = glm(H.g ~ 1, data=galv.c, family="poisson") WEIic.g = cens(WEI, type="interval", local=FALSE) fitw.g = gamlss(W ~ 1, data=galv.w, family=WEIic.g, trace=FALSE) rm(list=c("Year")) rp.g = c(5, 10, 20) * 10^(rep(0:2, each=3)) rl.g = sampleParameters(R=1000, fitc=fitc.g, fitw=fitw.g, n=rp.g) plot(rp.g, rep(50, dim(rl.g)[3]), type="n", log="x", ylim=c(20, 80), xlab="Return period [yr]", ylab="Return level [m s$^{-1}$]") grid(equilogs=FALSE) mtext("b", side=3, line=1, adj=0, cex=1.1) for(i in xvalues){ points(rp.g[i], quantile(rl.g[, 1, i], prob=.5, na.rm=TRUE), pch=19, col="black") lines(c(rp.g[i], rp.g[i]), c(quantile(rl.g[, 1, i], prob=.25, na.rm=TRUE), quantile(rl.g[, 1, i], prob=.025, na.rm=TRUE))) lines(c(rp.g[i], rp.g[i]), c(quantile(rl.g[, 1, i], prob=.75, na.rm=TRUE), quantile(rl.g[, 1, i], prob=.975, na.rm=TRUE))) } @ \vspace{-.75cm} \caption{Return periods for winds in (a) Miami-Dade and (b) Galveston counties.} \label{fig:returnlevelMiami} \end{figure} Andrew struck Miami in 1992 as a category 5 hurricane with 70 m~s$^{-1}$ winds. Your model indicates that the most likely return period for a cyclone of this magnitude is 1000 years, but it could be as short as 100 years. Return levels are higher at all return periods for Miami compared to Galveston. Miami is closer to the main tropical cyclone development region of the North Atlantic. This chapter showed how to create models from cyclone intensity data. We began by considering the set of lifetime maximum wind speeds for basin-wide cyclones and a quantile regression model for trends. We then showed how to model the fastest winds using models from extreme-value theory. The models estimate the return period of winds exceeding threshold intensities. We finished with a model for interval wind data that describe the hurricane experience at the county level. We demonstrated the model on data from Miami-Dade County. A categorical wind speed model can be used on tornado data where intensities are estimated in intervals defined by the Fujita scale.