Count Variability ================= "A big computer, a complex algorithm and a long time does not equal science."---Robert Gentleman ```{r loadData} load("annual.RData") mean(annual$B.1[annual$Year>1966]) ``` Large variation in small-count processes are sometimes misdiagnosed as physically meaningful. As an example, consider hurricane counts over a sequence of N years with a constant annual Poisson rate lambda. What is the probability that you will find at least M of these years with a count less than X (described as an inactive season) or a count greater than Y (described as an active season)? Here are the steps using R notation. * In a given year, the probability of the count h being less than X or greater than Y is PXY = 1 - ppois(Y) + ppois(X - 1)). In other words, it is one minus the probability that h lies between X and Y, inclusive. * Assign an indicator I = 1 for each year with h < X or h > Y. * Then the sum of I has a binomial distribution with probability PXY and N. * The probability of observing at least M of these years is then given as PM = 1 - pbinom(M - 1, N, PXY) You create the PM() function to perform these computations. ```{r extremeYears} PM = function(X, Y, lambda, N, M){ PXY = 1 - diff(ppois(c(X - 1, Y), lambda)) return(1 - pbinom(M - 1, N, PXY)) } ``` Arguments for ppois() are q (quantile) and lambda (rate) and the arguments for pbinom() are q, size, and prob. You use your function to answer the following question. Given an annual rate of 6 hurricanes per year (lambda), what is the probability that in a random sequence of 10 years (N) you will find at least two years (M) with a hurricane count less than 3 (X) or greater than 9 (Y)? ```{r probabilityLargeVariability} PM(X=3, Y=9, lambda=6, N=10, M=2) ``` Thus you find a `r I(round(PM(X=3,Y=9,lambda=6,N=10,M=2),2)*100)`% chance of having two years with large departures from the mean rate. The function is handy. It protects you against getting fooled by randomness. Indeed, the probability that at least one year in ten falls outside the range of +/-2 standard deviations from the mean is `r I(round(PM(X=3,Y=9,lambda=6,N=10,M=1),1)*100)`%. This compares to 37% for a set of variables described by a normal distribution and underscores the limitation of using ideas derived from continuous distributions on count data. On the other hand, if you consider the annual global tropical cyclone counts over the period 1981--2006 (From Elsner et al. 2008) you find a mean of 80.7 tropical cyclones per year with a range between 66 and 95. Assuming the counts are Poisson you use your function to determine the probability that no years have less than 66 or more than 95 in the 26-year sample. ```{r gtcaDataNaturePaper} gtca = c(66, 78, 72, 89, 89, 78, 78, 73, 79, 81, 70, 93, 73, 92, 78, 95, 88, 78, 72, 84, 84, 79, 81, 84, 90, 74) 1 - PM(X=66, Y=95, lambda=80.7, N=26, M=1) ``` This low probability provides suggestive evidence to support the notion that the physical processes governing global hurricane activity is more regular than Poisson. The regularity could be due to feedbacks to the climate system. For example, the cumulative effect of many hurricanes over a particular basin might make the atmosphere less conducive for activity in other basins. Or it might be related to a governing mechanism like the North Atlantic Oscillation (Elsner and Kocher 2000). ### Moving average A moving average removes year-to-year fluctuation in counts. The assumption is that of a smoothly varying rate process. You use the filter() function to compute running means. The first argument in the function is a univariate or multivariate time series and the second is the filter as a vector of coefficients in reverse time order. For a moving average of length $N$ the coefficients have the value of $1/N$. For example, to compute the 5-year running average of the basin-wide hurricane counts, type ```{r maFilter} ma = filter(dat$B.1, rep(1, 5)/5) str(ma, strict.width="cut") ``` The output is an object of class ts (time series). Values at the ends of the time series are not filtered so NA's are used. If you have an odd number of years then the number of values missing at the start of the filtered series matches the number of values missing at the end of the series. Here you create a new function called moveavg() and use it to compute the moving averages of basin counts over 5, 11, and 21 years. ```{r maFunction} moveavg = function(X, N){filter(X, rep(1, N)/N)} h.5 = moveavg(dat$B.1, 5) h.11 = moveavg(dat$B.1, 11) h.21 = moveavg(dat$B.1, 21) ``` Plot the moving averages on top of the observed counts. ```{r hurricaneMovingAverage} cls = c("grey", "red", "blue", "green") lg = c("Count", "5-yr rate", "11-yr rate", "21-yr rate") plot(dat$Yr, dat$B.1, xlab="Year", ylab="Hurricane count/rate", col="grey", type="h", lab=c(10, 7, 20), lwd=4) grid() points(dat$Yr, dat$B.1, type="h", lwd=4, col="grey") lines(dat$Yr, h.5, col="red", lwd=2) lines(dat$Yr, h.11, col="blue", lwd=2) lines(dat$Yr, h.21, col="green", lwd=2) legend("topleft", lty=1, lwd=c(4, 2, 2, 2), col=cls, legend=lg, bg="white", cex=.65) ``` Notice the reduction in the year-to-year variability as the length of the moving average increases. Note also that the low frequency variation is not affected. Check this by comparing the means (the mean is the zero frequency) of the moving averages. Thus a moving average is a low-pass filter. ### Seasonality One of the most obvious climatological characteristic of hurricanes is seasonality. Very few hurricanes occur before July 15th, September is the most active month, and the season is typically over by November. In general, the ocean is too cool and the wind shear too strong during the months of January through May and from November through December. Seasonality is evident in plots showing the historical number of hurricanes that have occurred on each day of the year. Here you model this seasonality to produce a probability of hurricane occurrence as a function of the day of year. You use the hourly interpolated best track data (best.use.RData). The data spans the years from 1851--2010. Import the data frame and subset on hurricane-force wind speeds. ```{r getBestUseDataAndSubset} load("best.use.RData") H.df = subset(best.use, WmaxS >= 64) head(H.df) ``` Next, create a factor variable from the day-of-year column (jd). The day of year starts on the first of January. You use only the integer portion as the rows correspond to separate hours. ```{r dayOfYearAsFactor} jdf = factor(trunc(H.df$jd), levels=1:365) ``` The vector contains the day of year (1 through 365) for all `r I(length(jdf))` hurricane hours in the data set. You could use 366, but there are no hurricanes on December 31st during any leap year over the period of record. Next, use the table() function on the vector to obtain total hurricane hours by day of year and create a count of hurricane days by dividing the number of hours and rounding to the nearest integer. ```{r tableHurrHours} Hhrs = as.numeric(table(jdf)) Hd = round(Hhrs/24, 0) ``` The vector Hd contains the number of hurricane days over the period of record for each day of the year. A plot of the raw counts shows the variation from day to day is rather large. Here you create a model that smooths these variations. This is done with the gamlss() function from the **gamlss** package (Rigby and Stasinopoulos 2005). You model your counts using a Poisson distribution with the logarithmic link as a function of day of year. ```{r fitGamlss, message=FALSE} require(gamlss) julian = 1:365 sm = gamlss(Hd ~ pb(julian), family=PO, trace=FALSE) ``` Here you use a non-parametric smoothing on the Julian day. The function is a penalized B-spline (Eilers and Marx 1996) and is indicated as pb() in the model formula. The penalized B-spline is an extension of the Poisson regression model that conserves the mean and variance of the daily hurricane counts and has a polynomial curve as the limit. The Poisson distribution is specified in the family argument with PO. Although there are a days with hurricanes outside the main season, your interest centers on the months of June through November. Here you create a sequence of Julian days defining the hurricane season and convert them to dates. ```{r hurricaneSeasonDates} hs = 150:350 doy = as.Date("1970-12-31") + hs ``` You then convert the hurricane days to a relative frequency to allow for a probabilistic interpretation. This is done for the actual counts and the smoothed modeled counts. ```{r convertToRelativeFrequency} ny = (2010 - 1851) + 1 Hdm = Hd[hs]/ny smf = fitted(sm)[hs]/ny ``` Finally you plot the modeled and actual daily frequencies by typing ```{r plotActualvsModeled} plot(doy, Hdm, pch=16, xlab="", ylab="Frequency (days/yr)") lines(doy, smf, lwd=2, col="red") ``` Circles show the relative frequency of hurricanes by day of year. The red line is the fitted values of a model for the frequencies. Horizontal tic marks indicate the first day of the month. On average hurricane activity increases slowly until the beginning of August as the ocean warms and wind shear subsides. The increase is more pronounced starting in early August and peaks around the first or second week in September. The decline starting in mid September is somewhat less pronounced than the increase and is associated with ocean cooling. There is a minor secondary peak during the middle of October related to hurricane genesis over the western Caribbean Sea. The climate processes that make this part of the basin relatively active during at this time of the year are likely somewhat different than the processes occurring during the peak of the season. Change Points ------------- Hurricane activity can change abruptly going from active to inactive in a matter of a year or so. In this case a change-point model is appropriate for describing the time series. Here a change point refers to a jump in the rate of activity from one set of years to the next. The underlying assumption is a discontinuity in the rates. For example, suppose hurricanes suddenly become more frequent in the years 1934 and 1990, then the model would still be Poisson, but with different rates in the periods (epochs) 1900--1933, 1934--1989, and 1990--2010. ### Counts The simplest approach is to restrict your search to a single change point. For instance, you check to see if a model that has a rate change during a given year is better than a model that does not have a change during that year. In this case you have two models; one with a change point and one without one. To make a choice, you check to see which model has the lower Schwarz Bayesian Criterion (SBC). The SBC is proportional to $-2\log[p(\hbox{data}|\hbox{model})]$, where $p(\hbox{data}|\hbox{model})$ is the probability of the data given the model. This comparison can be done using the gamlss() function. Make the package available and obtain the SBC value for each of three models by typing ```{r loadGamlssPackageCheckModels, message=FALSE} require(gamlss, quiet=TRUE, message=FALSE) gamlss(B.1 ~ 1, family=PO, data=dat, trace=FALSE)$sbc gamlss(B.1 ~ I(Yr >= 1910), family=PO, data=dat, trace=FALSE)$sbc gamlss(B.1 ~ I(Yr >= 1940), family=PO, data=dat, trace=FALSE)$sbc ``` Here the Poisson family is given as PO with the logarithm of the rate as the default link (Stasinopoulos and Rigby 2007). The first model is one with no change point. The next two are change-point models with the first having a change point in the year 1910 and the second having a change point in 1940. The change-point models use the indictor function I() to assign a TRUE or FALSE to each year based on logical expression involving the variable Yr. The SBC value is `r I(round(gamlss(B.1~1,family=PO,data=dat,trace=F)$sbc,1))` for the model with no change points. This compares with an SBC value of `r I(round(gamlss(B.1~I(Yr>=1910),family=PO,data=dat,trace=F)$sbc,1))` for the change point model where the change occurs in 1910 and a value of `r I(round(gamlss(B.1~I(Yr>=1940),family=PO,data=dat,trace=F)$sbc,1))` for the change point model where the change occurs in 1940. Since the SBC is lower in the latter case, 1940 is a candidate year for a change point. You apply the above procedure successively where each year gets considered in turn as a possible change point. You then plot the SBC as a function of year. ```{r sbcChangePoints} sbc.int = gamlss(B.1 ~ 1, family=PO, data=dat, trace=FALSE)$sbc + 2 * log(20) changepoints = 1901:2010 sbc.change = sapply(changepoints, function(x) gamlss(B.1 ~ I(Yr >= x), data=dat, trace=FALSE)$sbc) par(las=1, mgp=c(2, .4, 0), tcl=-.3) plot(changepoints, sbc.change, ylab="SBC value", xlab="Year", ylim=c(520, 550), lab=c(10, 7, 20), lwd=2, type="l") grid() abline(h=sbc.int, col="grey") lines(changepoints, sbc.change, lwd=2) changepoints2 = c(1995, 1948, 1944, 1932) rug(changepoints2, lwd=2, col="red") ``` The horizontal line is the SBC for a model with no change points and tick marks are local minimum of SBC. Here the SBC for the model without a change point is adjusted by adding $2\log(20)$ to account for the prior possibility of 5 or 6 equally likely change points over the period of record. Here you find four candidate change points based on local minima of the SBC. The years are 1995, 1948, 1944, and 1932. You assume a prior that allows only one change point per decade and that the posterior probability of the intercept model is 20 times that of the change point model. This gives you 12 possible models (1995 only, 1995 & 1948, 1995 & 1948 & 1932, etc) including the intercept-only model but excludes models with both 1944 and 1948 as the changes occur too close in time. Next, you estimate the posterior probabilities for each of the 12 models using $$ \mathrm{Pr}(M_i|\mathrm{data})=\frac{\exp(-.5\cdot\mathrm{SBC}(M_i))}{\sum_{j=1}^{12} \exp(-.5\cdot\mathrm{SBC}(M_j))} $$ where the models are given by $M_i$, for $i=1, \ldots, 12$. The results are shown in the table. ```{r posteriorProbabilities, message=FALSE} changepointstext = sapply(changepoints2, function(x) paste("I(Yr>=",x,")")) modelarray = do.call("expand.grid", rep(list(c(FALSE, TRUE)), 4))[-1, ] modelarray = modelarray[!(modelarray[2] & modelarray[3]), ] modelformulas = lapply(data.frame(t(modelarray)), function(x) formula(paste("B.1~", paste(changepointstext[x], collapse="+", sep=""), sep=""))) modelformulas = c(X1=formula("B.1~1"), modelformulas) modelfits = lapply(modelformulas, function(x) do.call("gamlss", list(x, family=quote(PO(mu.link="identity")), data=dat, trace=FALSE))) modelsbc = sapply(modelfits, function(x) x$sbc) modelspostprob = exp(-.5*(modelsbc-min(modelsbc))) modelspostprob = modelspostprob/sum(modelspostprob) modelspostprobround = round(modelspostprob, 3) modelorder = order(modelsbc) modelsandprob = data.frame(Formula= gsub(fixed=TRUE," ","",sapply(modelformulas[modelorder], function(x) deparse(x))), Probability=modelspostprobround[modelorder]) bestmodel = modelfits[[which(max(modelspostprob) == modelspostprob)[1]]] coefs = round(coef(bestmodel), 2) require(xtable) tbl = xtable(modelsandprob, label='tab:modelposteriorprobs', caption='Model posterior probabilities from most (top) to least probable.') print(tbl, math.style.negative=TRUE, caption.placement="top") ``` The top three models have a total posterior probability of `r I(round(sum(modelsandprob$Probability[1:3]),2)*100)`%. These models all include 1995 with 1932, 1944, and 1948 competing as the second most important change-point year. You can select any one of the models, but it makes sense to choose one with a relatively high posterior probability. Note the weaker support for the single change-point models and even less support for the no change point model. The single best model has change points in 1932 and 1995. The coefficients of this model are shown here. ```{r singleBestModel} tbl = xtable(glm(bestmodel), label='tab:coefficients', caption='Best model coefficients and standard errors.') print(tbl, math.style.negative=TRUE, caption.placement="top") ``` The model predicts a rate of `r I(round(as.numeric(coefs[1]),1))` hur/yr in the period 1900--1931. The rate jumps to `r I(round(sum(coefs[1:2]),1))` hur/yr in the period 1931--1994 and jumps again to `r I(round(sum(coefs),1))` in the period 1995--2010. ### Covariates To understanding what might be causing the abrupt shifts in hurricane activity, here you include known covariates in the model. The idea is that if the shift is no longer significant after adding a covariate, then you conclude that a change in climate is the likely causal mechanism. The two important covariates for annual basin-wide hurricane frequency are SST and the SOI. You first fit and summarize a model using the two change points and these two covariates. ```{r modelOne} model1 = gamlss(B.1 ~ I(Yr >= 1932) + I(Yr >= 1995) + sst + soi, family=PO, data=dat, trace=FALSE) summary(model1) ``` You find the change point at 1995 has the largest $p$-value among the variables. You also note that the model has an SBC of `r I(round(model1$sbc,1))`. You consider whether the model can be improved by removing the change point at 1995, so you remove it and refit the model. ```{r modelTwo} model2 = gamlss(B.1 ~ I(Yr >= 1932) + sst + soi, family=PO, data=dat, trace=FALSE) summary(model2) ``` You find all variables statistically significant ($p$-value less than 0.1) and the model has a SBC of `r I(round(model2$sbc,1))`, which is lower than the SBC of your first model that includes 1995 as a change point. Thus you conclude that the shift in the rate at 1995 is relatively more likely the result of a synchronization (Tsonis et al. 2006) of the effects of rising SST and ENSO on hurricane activity than is the shift in 1932. The shift in 1932 is important after including SST and ENSO influences providing evidence that the increase in activity at this time is likely due, at least in part, to improvements in observing technologies. ```{r modelThree} model3 = gamlss(B.1 ~ sst + soi, family=PO, data=dat, trace=FALSE) summary(model3) ``` A change-point model is useful for detecting rate shifts caused by climate and observational improvements. When used together with climate covariates it can help you differentiate between these two possibilities. However, change-point models are not particularly useful for predicting when the next change will occur. Continuous Time Series ---------------------- Sea-surface temperature, the SOI, and the NAO are continuous time series. Values fluctuate over a range of scales often without abrupt changes. In this case it can be useful to split the series into a few components where each component has a smaller range of scales. Here the goal is to decompose the SST time series as an initial step in creating a time-series model. The model can be used to make predictions of future SST values. Future SST values are subsequently used in your hurricane frequency model to forecast the probability of hurricanes (Elsner et al. 2008). You return to your montly SST values over the period 1856--2010. You input the data and create a continuous-valued time series object (sst.ts) containing monthly SST values beginning with January 1856. ```{r readSSTdata} con = "http://www.hurricaneclimate.com/storage/chapter-10/SST.txt" SST = read.table(con, header=TRUE) sst.m = as.matrix(SST[6:160, 2:13]) sst.v = as.vector(t(sst.m)) sst.ts = ts(sst.v, frequency=12, start=c(1856, 1)) ``` First you plot at your time series by typing ```{r plotSeries} plot(sst.ts, ylab="SST (C)") ``` The graph shows the time series is dominated by interannual variability. The ocean is coldest in February and March and warmest in August and September. The average temperature during March is `r I(round(colMeans(SST,na.rm=T)[4],1))`$^\circ$C and during August is `r I(round(colMeans(SST,na.rm=T)[9],1))`$^\circ$C. There also appears to be a trend toward greater warmth, although it is difficult to see because of the larger interannual variations. The observed series can be decomposed into a few components. This is done here using the stl() function, which accepts a time series object as its first argument and the type of smoothing window is specified through the s.window argument. ```{r decomposeTimeSeries} sdts = stl(sst.ts, s.window="periodic") ``` The seasonal component is found by a local regression smoothing of the monthly means. The seasonal values are then subtracted, and the remainder of the series smoothed to find the trend. The overall time-series mean value is removed from the seasonal component and added to the trend component. The process is iterative. What remains is the difference between the actual monthly values and the sum of the seasonal and trend components. If you have change points in your time series you can use the **bfast** package and the bfast() function instead to decompose your time series. In this case the trend component has the change points. To plot the raw and component series the data are prepared as follows. First a vector of dates is constructed using the seq.dates() function from the **chron** package. This allows you to display the graphs at the points that correspond to real dates. ```{r dateSequence, message=FALSE} require(chron) date = seq.dates(from="01/01/1856", to="12/31/2010", by="months") ``` Next a data frame is constructed that contains the vector of dates, the raw monthly SST time series, and the corresponding components from the seasonally decomposition. ```{r dataFrame} datw = data.frame(Date=as.Date(date), Raw=as.numeric(sst.ts), Seasonal=as.numeric(sdts$time.series[, 1]), Trend=as.numeric(sdts$time.series[, 2]), Residual=as.numeric(sdts$time.series[, 3])) head(datw) ``` Here the data are in the 'wide' form like a spreadsheet. To make them easier to plot as separate time series graphs you create a 'long' form of the data frame with the melt() function in the **reshape** package. The function melds your data frame into a form suitable for casting (Wickham and Hadley 2007). You specify the data frame and your Date column as your id variable. The function assumes remaining variables are measure variables (non id variables) with the column names turned into a vector of factors. ```{r convertWideToLong, message=FALSE} require(reshape) datl = melt(datw, id="Date") head(datl); tail(datl) ``` Here you make use of the **ggplot2** functions to create a facet grid to display your time series plots with the same time axis. The argument scale="free_y" allows the y axes to have different scales. This is important as the decomposition results in a large seasonal component centered on zero, while the trend component is smaller, but remains on the same scale as the raw data. ```{r timeSeriesDecompositionPlot, message=FALSE} require(ggplot2) ggplot(datl, aes(x=Date, y=value)) + geom_line() + facet_grid(variable ~., scale="free_y") + ylab("SST [C]") + xlab("") + theme_bw() ``` The observed (raw) values are shown in the top panel. The seasonal component, trend component, and residuals are also shown in separate panels on the same time-series axis. Temperatures increase by more than 0.5$^\circ$C over the past 100 years. But the trend is not monotonic. The residuals show large year-to-year variation generally between $-$0.15 and $+$0.15$^\circ$C with somewhat larger variation before about 1871. You can build separate time series models for each component. For example, for the residual component ($R_t$) an autoregressive moving average (ARMA) model can be used. An ARMA model with $p$ autoregressive terms and $q$ moving average terms [ARMA($p$, $q$)] is given by $$ R_t = \sum_{i=1}^p \phi_i R_{t-i} + \sum_{i=1}^q \theta_i \varepsilon_{t-i} + \varepsilon_t $$ where the $\phi_i$'s and the $\theta_i$'s are the parameters of the autoregressive and moving average terms, respectively and $\varepsilon_t$'s is random white noise assumed to be described by independent normal distributions with zero mean and variance $\sigma^2$. For the trend component an ARIMA model is more appropriate. An ARIMA model generalizes the ARMA model by removing the non-stationarity through an initial differencing step. Here you use the ar() function to determine the autoregressive portion of the series using the AIC. ```{r determineARorder} ar(datw$Trend) ``` Result shows an autoregressive order of 11 months. Continuing you assume the integrated and moving average orders are both one. ```{r ARIMAmodel} model = arima(datw$Trend, order=c(11, 1, 1)) ``` You then use the model to make monthly forecasts out to 36 months using the predict() method. Predictions are made at times specified by the newxreg argument. ```{r ForecastThirty36} nfcs = 36 fcst = predict(model, n.ahead=nfcs) ``` You plot the forecasts along the corresponding date axis by typing ```{r plotForecasts} newdate = seq.dates(from = "01/01/2011", to = "12/01/2013", by="months") plot(c(datw$Date[1801], newdate[nfcs]), c(min(datw$Trend), max(datw$Trend) + .3), type="n", ylab="SST [C]", xlab="Date (MM/YY)") grid() lines(datw$Date, datw$Trend, lwd=2) pm = fcst$pred pl = fcst$pred - 1.96 * fcst$se pu = fcst$pred + 1.96 * fcst$se xx = c(newdate, rev(newdate)) yy = c(pl, rev(pu)) polygon(xx, yy, border=NA, col="gray") lines(newdate, pm, lwd=2, col="red") ``` The observed values are in black and the forecast values are in red. A 95% confidence band is shown in gray. The confidence band is quite large after a few months. A forecast of the actual SST must include forecasts for the seasonal and residual components as well. Time Series Network ------------------- Network analysis is the application of graph theory. Graph theory is the study of mathematical structures used to model pairwise relations between objects. Objects and relations can be many things with the most familiar being people and friendships. Network analysis was introduced into climatology by (Tsonis and Roebber 2004). They used values of geopotential height on a spatial grid and the relationships were based on pairwise correlation. Here you use network analysis to examine year-to-year relationships in hurricane activity. The idea is relatively new (Lacasa et al. 2008) and requires mapping a time series to a network. The presentation follows the work of Elsner et al. (2009). ### Time series visibility How can a time-series of hurricane counts be represented as a network? Consider the following plot. ```{r timeSeriesLinks} load("annual.RData") source("get.visibility.R") yr = 1851:1870 net1 = get.visibility(annual$US.1[1:length(yr)]) par(las=1, mgp=c(2, .4, 0), tcl=-.3) plot(yr, annual$US.1[1:length(yr)], type="h", xlab="Year", ylab="Hurricane count", lwd=10, xaxt="n") axis(1, at=seq(1851, 1870, 1), labels=seq(1851, 1870, 1), cex.axis=.8) for(i in 1:nrow(net1$sm)){ lines(c(yr[net1$sm[i, 1]], yr[net1$sm[i, 2]]), c(annual$US.1[net1$sm[i, 1]], annual$US.1[net1$sm[i, 2]]), col="lightgrey") } ``` The time series of U.S. hurricane counts forms a discrete landscape. A bar is connected to another bar if there is a line of sight (visibility line) between them. Here visibility lines are drawn for all ten bars. It is clear that 1869 by virtue of its relatively high hurricane count (4) can see 1852, 1854, 1860, 1861, 1867, 1868, and 1870, while 1868 with its zero count can see only 1867 and 1869. Lines do not cut through bars. In this way, each year in the time series is linked in a network. The nodes are the years and the links (edges) are the visibility lines. More formally, let $h_a$ be the hurricane count for year $t_a$ and $h_b$ the count for year $t_b$, then the two years are linked if for any other year $t_i$ with count $h_i$ $$ h_i \leq h_b + (h_a - h_b)\frac{t_b-t_i}{t_b-t_a} $$ By this definition each year is visible to at least its nearest neighbors (the year before and the year after), but not itself. The network is invariant under rescaling the horizontal or vertical axes of the time series as well as under horizontal and vertical translations (Lacasa et al. 2008). In network parlance, years are nodes and the visibility lines are the links (or edges). The network arises by releasing the years from chronological order and treating them as nodes linked by visibility lines. Here we see that 1869 is well connected while 1853 is not. Years featuring many hurricanes generally result in more links especially if neighboring years have relatively few hurricanes. This can be seen by comparing 1853 with 1858. Both years have a single hurricane, but 1858 is adjacent to years that also have a single hurricane so it is linked to four other years. In contrast, 1853 is next to two years each with three hurricanes so it has the minimum number of two links. The degree of a node is the number of links connected to it. The function get.visibility() available in **get.visibility.R** computes the visibility lines. It takes a vector of counts as input and returns three lists; one containing the incidence matrix (sm), another a set of node edges (node), and the third a degree distribution (pk), indicate the number of years with $k$ number of edges. Compute the visibility lines by typing, ```{r getVisibilityCode} vis = get.visibility(annual$US.1) ``` ### Network plot You use the network() function from the **network** package (Butts et al. 2011) to create a network object from the incidence matrix by typing ```{r networkPackageNetworkFunction, message=FALSE} require(network) net = network(vis$sm, directed=FALSE) ``` Then use the plot() method for network objects to graph the network. ```{r plotNetwork} plot(net, label=1851:2010, label.cex=.6, vertex.cex=1.5, label.pos=5, edge.col="grey") ``` Node color indicates the number of links (degree) going from light purple (few) to red. ```{r plotNetwork2, message=FALSE} require(sna) breaks = c(0, 2, 5, 10, 20, 100) adj = as.sociomatrix(net) deg = sna::degree(adj, gmode="graph") catcut = as.numeric(cut(deg, breaks)) cls1=c("#F1EEF6", "#D7B5D8", "#DF65B0", "#DD1C77", "#980043") cls2 = c(rep("black", 4), "white") plot(net, label=1851:2010, mode="kamadakawai", label.cex=.4, vertex.cex=2.5, vertex.lty=0, edge.col="lightgrey", edge.lwd=.2, vertex.col=cls1[catcut], label.pos=5, label.col=cls2[catcut]) ``` The placement of years on the network plot is based on simulated annealing (Kamada and Kawai 1989) and the nodes colored based on the number of edges. Years with the largest number of edges are more likely to be found in dense sections of the network and are colored dark red. Years with fewer edges are found near the perimeter of the network and are colored light purple. The **sna** package (Butts 2010) contains functions for computing properties of your network. First create a square adjacency matrix where the number of rows is the number of years and each element is a zero or one depending on whether the years are linked and with zeros along the diagonal (a year is not linked with itself). Then compute the degree of each year indicating the number of years it can see and find which years can see farthest. ```{r yearVsNumberEdges} adj = as.sociomatrix(net) deg = sna::degree(adj, gmode="graph") deg[1:9] annual$Year[order(deg, decreasing=TRUE)][1:9] ``` The year with the highest degree is `r I(annual$Year[order(deg,decreasing=TRUE)][1])` with `r I(sort(deg,decreasing=TRUE)[1]) links. Two other years with high degree include `r I(annual$Year[order(deg,decreasing=TRUE)][2])` with `r I(sort(deg,decreasing=TRUE)[2])` links and `r I(annual$Year[order(deg,decreasing=TRUE)][3])` with `r I(sort(deg,decreasing=TRUE)[3])` links. Other relatively highly connected years are `r I(annual$Year[order(deg,decreasing=TRUE)][4])`, `r I(annual$Year[order(deg,decreasing=TRUE)][5])`, `r I(annual$Year[order(deg,decreasing=TRUE)][6])`, and `r I(annual$Year[order(deg,decreasing=TRUE)][7])` in that order. The average degree is `r I(round(mean(deg),1))`, but the degree distribution is skewed so this number says little about a typical year. ### Degree distribution and anomalous years The total number of links in the network (sum of the links over all nodes) is `r I(sum(vis$pk$degree*vis$pk$k))`. There are `r I(sum(vis$pk$degree))` nodes, so 20% of the network consists of `r I(.2*sum(vis$pk$degree))` them. If you rank the nodes by number of links, you find that the top 20% account for `r I(round(sum(sort(deg,decreasing=TRUE)[1:round(.2*sum(vis$pk$degree),0)])/sum(vis$pk$degree*vis$pk$k),1)*100)`% of the links. You plot the degree distribution of your network by typing, ```{r plotDegreeDistribution} plot(vis$pk$k, cumsum(vis$pk$P), pch=16, log="x", ylab="Proportion of Years With k or Fewer Links", xlab="Number of Links (k)") ``` The distribution is the cumulative percentage of years with $k$ or fewer links as a function of the number of links. The horizontal axis is plotted using a log scale. Just over 80% of all years have ten or fewer links and over 50% have five or fewer. Although the degree distribution is skewed to the right it does not appear to represent a small-world network. You perform a Monte Carlo (MC) simulation by randomly drawing counts from a Poisson distribution with the same number of years and the same hurricane rate as the observations. A visibility network is constructed from the random counts and the degree distribution computed as before. The process is repeated 1000 times after which the median and quantile values of the degree distributions obtained. ```{r degreeDistribution2} kk = numeric(); cs = numeric() for(i in 1:1000){ visR = get.visibility(rpois(length(annual$Year), mean(annual$US.1))) kk = c(kk, visR$pk$k) cs = c(cs, cumsum(visR$pk$P)) } K = sort(deg, decreasing=TRUE)[1] u = numeric(); m=numeric(); l=numeric() for(k in 1:K){ u[k] = quantile(cs[kk==k], prob=.975) m[k] = quantile(cs[kk==k], prob=.5) l[k] = quantile(cs[kk==k], prob=.025) } par(las=1, mgp=c(2, .4, 0), tcl=-.3) plot(vis$pk$k, cumsum(vis$pk$P), pch=16, log="x", ylab="Proportion of years with $\\le k$ links", xlab="Number of links ($k$)") abline(v=c(1, 2, 5, 10, 20), col="lightgrey", lty=2) abline(h=seq(0, 1, .2), col="lightgrey", lty=2) xxx = c(1:K, rev(1:K)) yyy = c(l, rev(u)) polygon(xxx, yyy, border=NA, col="lightgray") lines(1:K, m, col="red", lwd=2) points(vis$pk$k, cumsum(vis$pk$P), pch=16) ``` The median distribution is shown as a red line and the 95% confidence interval shown as a gray band. Results indicate that the degree distribution of your hurricane count data does not deviate significantly from the degree distribution of a Poisson random time series. However it does suggest a new way to think about anomalous years. Years are anomalous not in a statistical sense of violating a Poisson assumption, but in the sense that the temporal ordering of the counts identifies a year that is unique in that it has a large count but is surrounded before and after by years with low counts. Thus we contend that node degree is a useful indicator of an anomalous year. That is, a year that stands above most of the other years, but particularly above its 'neighboring' years represents more of an anomaly in a physical sense than does a year that is simply well-above the average. Node degree captures information about the frequency of hurricanes for a given year and information about the relationship of that frequency to the frequencies over the given year's recent history and near future. The relationship between node degree and the annual hurricane count is tight, but not exact. Years with a low number of hurricanes are ones that are not well connected to other years, while years with an above normal number are ones that are more connected on average. The Spearman rank correlation between year degree and year count is `r I(round(cor(deg,annual$US.1,method="s"),2))`. But this is largely a result of low count years. The correlation drops to `r I(round(cor(deg[annual$US.1>2],annual$US.1[annual$US.1>2],method="s"),2))` when considering only years with more than two hurricanes. Thus high count is necessary but not sufficient for characterizing the year as anomalous, as perhaps it should be. ### Global metrics Global metrics are used to compare networks from different data. One example is the diameter of the network as the length of the longest geodesic path between any two years for which a path exits. A geodesic path (shortest path) is a path between two years such that no shorter path exists. For instance you see that 1861 is connected to 1865 directly and through a connection with 1862. The direct connection is a path of length one while the connection through 1862 is a path of length two. The **igraph** package (Csardi and Nepusz 2006) contains functions for computing network analytics. To find the diameter of your visibility network load the package, create the network (graph) from the list of edges, then use the diameter() function. Prefix the function name with the package name and two colons to avoid a conflict with the same name from another loaded package. ```{r igraphDiameter, message=FALSE} require(igraph) vis = get.visibility(annual$US.1) g = graph.edgelist(vis$sm, directed=FALSE) igraph::diameter(g) ``` The result indicates that any two years are separated by at most `r I(igraph::diameter(g))` links although there is more than one such geodesic. Transitivity measures the probability that the adjacent nodes are themselves connected. Given that year $i$ can see years $j$ and $k$, what is the probability that year $j$ can see year $k$? In a social context it indicates the likelihood that two of your friends are themselves friends. To compute the transitivity for your visibility network type, ```{r transitivity} tran = transitivity(g) round(tran, 3) visG = get.visibility(annual$G.1) gG = graph.edgelist(visG$sm, directed=FALSE) tranG = transitivity(gG) visF = get.visibility(annual$F.1) gF = graph.edgelist(visF$sm, directed=FALSE) tranF = transitivity(gF) ``` Transitivity tells you that there is a `r I(round(tran,3)*100)`% chance that two adjacent nodes of a given node are connected. The higher the probability, the greater the network density. The visibility network constructed from Gulf hurricane counts has a transitivity of `r I(round(tranG, 3))`, which compares with a transitivity of `r I(round(tranF, 3))` for the network constructed from Florida counts. The network density is inversely related to interannual variance, but this rather large difference provides some evidence to support clustering of hurricanes in the vicinity of Florida relative to the Gulf coast region. An MC simulation helps you interpret the difference against the backdrop of random variations. An important global property is the minimum spanning tree. A tree is a connected network that contains no closed loops. By 'connected' we mean that every year in the network is reachable from every other year via some path through the network (Newman 2010). A tree is said to span if it connects all the years together. A network may have more than one spanning tree. The minimum spanning tree is the one with the fewest number of edges. A network may contain more than one minimum spanning tree. You compute the minimum spanning tree by typing ```{r minimumSpanningTree} mst = minimum.spanning.tree(g) net = network(get.edgelist(mst)) ``` The result is an object of class igraph. This is converted to a network object by specifying the edge list in the network() function. You plot the network tree by typing ```{r plotMST} plot(net) ``` A graph with the nodes labeled and colored according to the level of 'betweenness' with arrows pointing toward later years is obtained by typing ```{r plotMST2} breaks = c(-1, 178, 529, 1210, 4322, 5824) btw = betweenness(g) catcut = as.numeric(cut(btw[-1], breaks)) cls1 = c("#FFFFCC","#C2E699","#78C679","#31A354","#006837") cls2 = c(rep("black", 4), c("white")) plot(net, label=1851:2010, mode="kamadakawai", label.cex=.4, vertex.cex=2.5, vertex.lty=0, edge.col="lightgrey", edge.lwd=.2, vertex.col=cls1[catcut], label.pos=5, label.col=cls2[catcut]) ``` The node betweenness (or betweenness centrality) is the number of geodesics (shortest paths) going through it. By definition the minimum spanning tree must have a transitivity of zero. You check this by typing ```{r transitivityMST} transitivity(mst) ``` In summary, the visibility network is the set of years as nodes together with links defined by a straight line (sight line) on the time series graph such that the line does not intersect any year's hurricane count bar. Topological properties of the network, like betweenness, might provide new insights into the relationship between hurricanes and climate.