Chapter 4: Bayesian Statistics ============================== "Probability does not exist."---Bruno de Finetti ```{r datePackages, message=FALSE, error=FALSE} #setwd("~/Dropbox/Book/Chap04") date() #require(knitcitations) #require(bibtex) require(LearnBayes) #cleanbib() #bib = read.bibtex("../allbook.bib") ``` Learning About the Proportion of Landfalls ------------------------------------------ Here you build a Bayesian model for the proportion of all hurricanes that make landfall. The approach follows Albert (2009). Before examining the data, you hold a belief about the value of this proportion. You model your belief in terms of a prior distribution. Then after examining some data, you update your belief about the proportion by computing the posterior distribution. This setting also allows you to predict the likely outcomes of a new sample taken from the population. It allows you to predict, for example, the proportion of landfalls for next year. The use of the pronoun 'you' focuses attention on the Bayesian viewpoint that probabilities are simply how much *you* personally believe that something is true. Probabilities are subjective and are based on the relevant information available to you. Here you think of a population consisting of all past and future hurricanes in the North Atlantic. Then let $\pi$ represent the proportion of this population that hit the United States at hurricane intensity. The value of $\pi$ is unknown. You are interested in learning about what the value $\pi$ could be. Bayesian statistics requires you to represent your belief about the uncertainty in this percentage with a probability distribution. The distribution reflects your subjective prior opinion about plausible values of $\pi$. Before you examine a sample of hurricanes, you think about what the value of $\pi$ might be. If all hurricanes make landfall $\pi$ is one if none make landfall $\pi$ is zero. This is the nature of a proportions, bounded below by zero and above by one. Also, while the number of hurricanes is an integer count, the percentage that make landfall is a real value. As a climatologist you are aware that not all hurricanes make it to land. The nature of percentages together with your background provides you with information about $\pi$ that is called 'your prior.' From this information suppose you believe that the percentage of hurricanes making landfall in the United States is equally likely to be smaller or larger than 0.3. Moreover, suppose you are 90% confident that $\pi$ is less than 0.5. A family of densities for a proportion is the beta. The beta density, here representing your prior belief about the population percentage of all hurricanes making landfall $\pi$, is proportional to $$ g(\pi) \propto \pi^{a-1} (1-\pi)^{b-1} $$ where the parameters $a$ and $b$ are chosen to reflect your prior beliefs about $\pi$. The mean of a beta density is $m = a/(a+b)$ and the variance is $v = m(1-m)/(a+b+1)$. Unfortunately it is difficult to assess values of $m$ and $v$ for distributions like the beta that are not symmetric. It is easier to obtain $a$ and $b$ indirectly through statements about the percentiles of the distribution. Here you have a belief about the median (0.3) and the 90th percentile (0.5). The beta.select() function in the **LearnBayes** package (Albert 2011) is useful for finding the two parameters (shape and scale) of the beta density that match this prior knowledge. The inputs are two lists, q1 and q2, that define these two percentiles, and the function returns the values of the corresponding beta parameters, $a$ and $b$. ```{r percentiles} q1 = list(p=.5, x=.3) q2 = list(p=.9, x=.5) beta.select(q1, q2) ``` Note the argument p is the distribution percentile and the argument x is a value for $\pi$. Now you have your prior specified as a continuous distribution. You should plot your prior using the \curve() function to see if it looks reasonable relative to your beliefs. ```{r continuousPrior, fig.width=4, fig.height=3} par(las=1, mgp=c(2, .5, 0), tcl=-.3) a = beta.select(q1, q2)[1] b = beta.select(q1, q2)[2] curve(dbeta(x, a, b), from=0, to=1, xlab="Proportion of landfalls", ylab="Density", lwd=4, las=1, col="green") abline(v=.5, lty=2) abline(v=.3, lty=2) curve(dbeta(x, a, b), lwd=4, col="green", add=TRUE) ``` The distribution appears to adequately reflect your prior knowledge of land fall percentages. For reference, the vertical dashed lines are the values you specified for the median and the 90th percentile. The abline() function is used after the curve is plotted. The v argument in the function specifies where along the horizontal axis to draw the vertical line on the graph. Recall, to learn more about the function use ?abline. This is a start, but you need to take your prior distribution and combine it with a likelihood function. The likelihood function comes from your data. It is a function that describes how likely it is, given your model, that you observed the data you actually have. It might sound a bit confusing, but consider that if you think you have a good model for your data, then the probability that your model will replicate your data should be relatively high. For example, consider the number of hurricanes over the most recent set of years. Read the data files containing the annual basin-wide A and landfall US counts and compute the sum. The basin counts were obtained from http://www.aoml.noaa.gov/hrd/tcfaq/E11.html on January 17, 2012. ```{r readData} con = "http://hurricaneclimatology.squarespace.com/storage/chapter-4/H.txt" US = read.table(con, header=TRUE) con = "http://hurricaneclimatology.squarespace.com/storage/chapter-4/ATL.txt" A = read.table(con, header=TRUE) Yr = 2006 sum(A$H[A$Year >= Yr]) sum(US$All[US$Year >= Yr]) ``` You find `r I(sum(US$All[US$Year>=Yr]))` of the `r I(sum(A$H[A$Year>=Yr]))` hurricanes made land fall in the United States. You could simply report that $\pi$ = `r I(round(sum(US$All[US$Year>=Yr])/sum(A$H[A$Year>=Yr]),2))` or `r I(round(sum(US$All[US$Year>=Yr])/sum(A$H[A$Year>=Yr])*100,0))`% of hurricanes make land. However this is based on only one sample and it does not include your prior belief about $\pi$. In addition you might be interested in predicting the number of land falls in a new sample of next year's hurricanes. If you regard a 'success' as a hurricane making land fall and you take a random sample of $n$ hurricanes with $s$ successes and $f=n-s$ failures, then the *likelihood function* is given by $$ L(s, f | \pi) \propto \pi^s (1 - \pi)^f,\ \hspace{1cm} 0 < \pi < 1 $$ The likelihood function is defined by the observed set of outcomes. Specifically, the likelihood of $\pi$ given that you observed `r I(sum(US$All[US$Year>=Yr]))` land falls in a set of `r I(sum(A$H[A$Year>=Yr]))` hurricanes is equal to the probability of that particular observed set of outcomes. Here the model is the beta density with parameters $a=s+1$ and $b=f+1$. You can convince yourself that the likelihood is maximized when $\pi$ = `r I(round(sum(US$All[US$Year>=Yr])/sum(A$H[A$Year>=Yr]),2))` for $s$ = `r I(sum(US$All[US$Year>=Yr]))` and $f$ = `r I(sum(A$H[A$Year>=Yr])-sum(US$All[US$Year>=Yr]))` by plotting $L(\pi)$ for $\pi$ from 0 to 1. Then with a likelihood function describing your data along with your prior beliefs, your posterior density for $\pi$ is obtained up to a proportionality constant, by multiplying the prior density ($g(\pi)$) by the likelihood ($L(\hbox{data}|\pi)$), which is Bayes rule. $$ g(\pi | \hbox{data}) \propto g(\pi) \times L(\hbox{data} | \pi) $$ Your prior represents your thinking about the parameter (in terms of probability) of interest before examining the data. Afterwards you have a likelihood function representing the probability of your data, given values for the parameter. The posterior probability distribution of $\pi$ given your data is computed using Bayes rule. It can be shown that if the prior is a beta density with parameters $a$ and $b$ and the likelihood is a beta density with parameters $s$ and $f$, then the posterior density is also a beta density with parameters $a+s$ and $b+f$. This is an example of a Bayesian conjugate model, where the prior and posterior densities have the same functional form. Since the prior, likelihood, and posterior are all in the same beta family of densities, you can use dbeta() to compute the values. To compute and plot them together, type ```{r priorLikePost, fig.width=4, fig.height=3} par(las=1, mgp=c(2, .5, 0), tcl=-.3) s = sum(US$All[US$Year >= Yr]) f = sum(A$H[A$Year >= Yr]) - sum(US$All[US$Year >= Yr]) curve(dbeta(x, a + s, b + f), from=0, to=1, xlab="Proportion of landfalls", ylab="Density", col=1, lwd=4, las=1) grid() curve(dbeta(x, a + s, b + f), add=TRUE, col=1, lwd=4) curve(dbeta(x, s + 1, f + 1), add=TRUE, col=2, lwd=4) curve(dbeta(x, a, b), add=TRUE, col=3, lwd=4) legend("topright",c("Prior", "Likelihood", "Posterior"), col=c(3, 2, 1), lwd=c(3, 3, 3), bg="white") ``` Note that the posterior density resembles the likelihood but it is shifted in the direction of the prior. This is always the case. The posterior is a weighted average of the prior and the likelihood where the weights are proportional to the precision. The greater the precision, the more weight it carries in determining the posterior. For your prior, the precision is related to how committed you are to particular values. Believing there is a 90% chance that the proportion is less than 0.5 provides a quantitative level of commitment. The less committed you are, the flatter the prior distribution, and the less weight it plays in the posterior. For your data, the precision is inversely related to the standard error, so directly related to the sample size. The more data you have, the more weight the likelihood plays in determining the posterior. Repeat using a somewhat more vague prior and a somewhat more precise prior. ### Inference The posterior distribution contains all the information you need to make inferences. In this example since the posterior is a beta distribution, the pbeta() (cumulative distribution) and qbeta() (quantile) functions can be used. For example, given your prior beliefs and your sample of data, how likely is it that the population percentage of land falls is less than or equal to 25%? This is answered by computing the posterior probability $P(\pi \leq 0.25 | \hbox{data, prior})$, which is given by ```{r pbetaFunction} pbeta(q=.25, a + s, b + f) ``` You interpret this probability in a natural way. You can state that given the evidence in hand (data and your prior belief), there is a `r I(round(pbeta(.25,a+s,b+f)*100,1))`% chance that less than or equal to a quarter of all hurricanes hit the United States (at hurricane intensity). Or you can state that it is quite unlikely (`r I(round(100-pbeta(.3,a+s,b+f)*100,1))`%) that more than 3 in 10 hurricanes make U.S. landfall [1 - pbeta(.3, a+s, b+f)]. You can't do this with classical statistics. Instead, the $p$-value resulting from a significance test is the evidence in support of a null hypothesis. Suppose the null hypothesis is that the population proportion exceeds 3 in 10. You state $H_o: \pi > 0.3$ against the alternative $H_a: \pi \leq 0.3$ and decide between the two on the basis of a $p$-value obtained by typing ```{r proportionsTest} prop.test(s, s + f, p=.3, alt="less")$p.value ``` where the argument p specifies the value of the null hypothesis and the argument alt specifies the direction of the alternative hypothesis. The $p$-value of `r I(round(prop.test(s,s+f,p=.3,alt="less")$p.value,3))` indicates that if the null is true (the proportion is greater than 0.3), the evidence seems unusual. You conclude there is moderate evidence against the null. However, it is incorrect to conclude from the $p$ value that there is only a `r I(round(prop.test(s,s+f,p=.3,alt="less")$p.value*100,1))`% chance that the proportion of landfalls is greater than 0.3. ### Credible Interval The natural interpretation of an inference extends to the interpretation of a posterior summary. For instance, the 95% interval estimate for the percentage of landfalls is obtained from the 2.5th and 97.5th percentiles of the posterior density. This is done with the qbeta() function by typing ```{r credibleInterval} qbeta(c(.025, .975), a + s, b + f) ``` You state the 95% *credible interval* for the proportion is [`r I(round(qbeta(.025,a+s,b+ f),3))`, `r I(round(qbeta(.975,a+s,b+f),3))`] and conclude you are 95% confident that the true proportion lies inside this interval. Note the difference in interpretation. With a credible interval you say that given the data and your prior beliefs there is a 95% chance that population proportion lies within this particular interval. With a confidence interval you say you are 95% confident that the method produces an interval that contains the true proportion. In the former the population parameter is the random variable, in the latter the interval is the random variable. The 95% confidence interval estimate for the proportion $\pi$ using a large enough sample of data, where $\hat p$ is the sample proportion and $n$ is the sample size, is given by $$ \hat p \pm \sqrt{\frac{\hat p(1-\hat p)}{n}} \times \Phi^{-1}(0.975) $$ where $\Phi^{-1}(0.975)$ (inverse of the cumulative distribution function) is the 97.5th percentile value from a standard normal distribution. The confidence interval is obtained by typing ```{r propsTest} prop.test(s, s + f)$conf.int ``` The results allow you to state the 95% confidence interval for the proportion is [`r I(round(prop.test(s,s+f)$conf.int[1],digits=3))`, `r I(round(prop.test(s,s+f)$conf.int[2],digits=3))`] and conclude that if you had access to additional samples and computed the interval in this way for each sample, 95% of the intervals will cover the true proportion. In some cases the intervals produced (confidence and credible) from the same data set are identical. They are almost certainly different if an informative prior is included and they may be different even if the prior is relatively uninformative (vague). However, the interpretations are alway different. With Bayesian statistics inferential statements are easy to communicate. It's natural to talk about the probability that $\pi$ falls within an interval or the probability that a hypothesis is true. Also, Bayes rule is the only thing you need to remember. It's used for small and large samples. It is the only consistent way to update your probabilities. The cost is that you need to specify your prior beliefs. ### Predictive Density So far you focused on learning about the population proportion of hurricanes that make landfall. Suppose your interest is predicting the number of U.S. landfalls ($\tilde l$) in a future sample of seven hurricanes. If your current understanding of $\pi$ is contained in the density $g(\pi)$ (e.g., the posterior density), then the predictive density for $\tilde l$ is given by $$ f(\tilde l) = \int f(\tilde l |\pi) g(\pi) d\pi $$ The beta predictive probabilities are computed using the function pbetap() from the **LearnBayes** package. The inputs are a vector ab containing the beta parameters, the size of the future sample m, and a vector of the number of future landfalls lf. ```{r posteriorLandfalls} ab = c(a + s, b + f) m = 7; lf = 0:m plf = pbetap(ab, m, lf) round(cbind(lf, plf), digits=3) ``` The predictive density is obtained through simulation. For example, to obtain $\tilde l$, first simulate $p$ from $g(p)$, and then simulate $\tilde l$ from the binomial distribution, $f_B(l | n, p)$. You can use this approach on your beta posterior. First simulate 1000 draws from the posterior and store them in p. ```{r simulatePredictiveprobabilities} p = rbeta(n=1000, a + s, b + f) ``` Then simulate values of $\tilde l$ for these random values using the rbinom() function and tabulate them. ```{r simulateLandfallcounts} lc = rbinom(n=1000, size=7, prob=p) table(lc) ``` The table indicates that, of the 1000 simulations, `{r I(table(lc)[1])` of them resulted in no landfalls, `r I(table(lc)[1])` of them resulted in one landfall, and so on. You save the frequencies in a vector and then convert them to probabilities by dividing by the total number of simulations. Finally plot the probabilities using the histogram argument type="h". ```{r Predictedlandfallprobs, fig.width=4, fig.height=3} par(las=1, mgp=c(2, .5, 0), tcl=-.3) freq = table(lc) pp = freq/sum(freq) plot(pp, type="h", xlab="Number of hurricanes", las=1, lwd=8, ylim=c(0, .4), ylab="Predictive probability") ``` It is most likely single count is one. The probability of four or hitting the United States is `r I((1-sum(pp[1:4]))*100)`%. The cumulative sum of the probabilities is found by typing ```{r cumsumProb} cumsum(pp) ``` The probability of three or fewer landfalls is `r I(cumsum(pp)[4])`. Suppose you wish to summarize this discrete predictive distribution by an interval that covers a certain amount of the probability. This can be done using the function discint(), again from the **LearnBayes** package. You combine the vector of probabilities with a vector of counts and specify a coverage probability as ```{r probabilityInterval} mc = length(pp) - 1 int = discint(cbind(0:mc, pp), .95) int ``` The output has two lists, the minimum probability (prob) greater than the specified coverage probability (here 95%) and the list of counts (set) over which the individual probabilities are summed. You can see the probability is `r I(discint(cbind(0:mc,pp),.95)$prob*100)`% that the number of U.S. hurricanes is either one of these four counts. The interval covers a range of `r I(length(int$set))` counts, which is necessarily wider than the range of counts computed by using the bounds on the `r I(int$prob*100)`% credible interval around the population proportion. This is because in predicting a specific value, as with the linear regression model, there are two sources of uncertainty. Here the uncertainty about the population proportion $\pi$ and the uncertainty about the count given an estimate of $\pi$. ### Is Bayes Rule Needed? Since all probabilities are ultimately subjective, why is Bayes rule needed? Why don't you simply examine your data and subjectively determine your posterior probability distribution, updating what you already know? Wouldn't this save you trouble? No. It will likely cause more trouble. The catch is that it is hard to assess probabilities rationally. Probabilities must be non-negative and sum to one. If these are not met, then your probabilities are said to be inconsistent. In fact they should obey certain axioms of coherent assessment. Consider event A to be more likely than event B, and event B to more likely than event C, then you should consider event A to be more likely than event C. If your probabilities fail to obey transitivity, then your assessment will be inconsistent in a decision-making sense. In practice this is not a serious impediment since you can easily remove your inconsistency once you're made aware of it. If someone notes an arithmetic error you would certainly correct it. Similarly, if someone notes that your prior probabilities are not transitive, you would change them accordingly. In this regard, it is sometimes useful to attempt to assess the same set of probabilities in more than one way in order to `check' your assessments. You might assess probabilities using the mean and variance and then compare these probabilities to an assessment based on quantiles. Importantly, if you obey the axioms of coherence, then to revise your probabilities you must use Bayes rule. To do otherwise would be inconsistent. Numerous psychological studies have indicated that people do not always intuitively revise probabilities on the basis of this rule. People tend to give too little `weight' to the data. In general, this means that they do not change their probabilities as much as Bayes theorem tells them that they should. By using the rule you safeguard against inconsistent assessment of the available information. Bayesian Computation -------------------- For simple problems, the likelihood functions are standard, and if you use a conjugate model where the prior and likelihood densities have the same functional form, deriving the posterior density poses no great computational burden. Indeed this is why conjugate models like the beta are widely used in Bayesian analysis. You showed above the probability of a random hurricane hitting the United States can be modeled using a beta prior. In this case, computation is nothing but addition. But Bayesian computation quickly becomes more challenging when working with more complicated models, or when you use non-conjugate models. Characterizing the posterior density in these cases becomes non-trivial. This is often the case. Options exist. One is brute force. When the posterior distribution is not a familiar function, you can compute values over a grid and then approximate the continuous posterior by a discrete one. This is possible for models with one or two parameters. In situations where you can directly sample from the posterior, a Monte Carlo (MC) algorithm gives an estimate for the posterior mean for any function of your parameters. In the situation where can't directly sample you can use rejection sampling provided you have a suitable proposal density or, in the most general case, you can adopt a Markov chain Monte Carlo (MCMC) approach. ### Time-to-Acceptance Publication is central to science. Authors, wanting to be first in an area, are keen about the speed of the manuscript review process. Editors, focused on ensuring a thorough review, are well aware that timeliness is important to a journal's reputation. It can also be a practical matter. For example, there is usually a cutoff date after which a manuscript will not be considered by an assessment report. Here you assume information about manuscript review time might be useful to authors and editors, especially if it can say something about future submissions. This motivates you to collect data on publication times from recent journals and to model them using a Bayesian approach. A Bayesian model has the advantage that output is in the form of a predictive distribution. Here you can't exploit conjugacy so you approximate the posterior with a discrete distribution. You sample directly from the distribution and use a MC algorithm for summarizing your samples. We use the American Meteorological Society (AMS) journals and the keyword `hurricane' appearing in published titles over the years 2008--2010. The search is done from the website \url{http://journals.ametsoc.org/action/doSearch}. Selecting `Full Text' on a particular article brings up the abstract, keywords, and two dates: received and accepted---in month, day, year format. The data are entered manually but are available to you by typing ```{r readHURARTdata} con = "http://hurricaneclimatology.squarespace.com/storage/chapter-4/hurart.txt" art = read.csv(con, header=TRUE) ``` For each article both dates are provided along with the lead author's last name and the name of the journal. There are `r I(dim(art)[1])` articles with the word `hurricane' in the title over the three-year period. Of these `r I(sum(art$FinYr==2008))`, `r I(sum(art$FinYr==2009))`, and `r I(sum(art$FinYr==2010))` had `accepted' years of 2008, 2009, and 2010, respectively. Articles appearing in 2008 but with accepted years before 2008 are not included. Ten different journals are represented. ```{r stackedHistogram} art$TTA = difftime(ISOdate(art$FinYr, art$FinMo, art$FinDay), ISOdate(art$RecYr, art$RecMo, art$RecDay), units="days") art$ftta = cut(as.numeric(art$TTA), breaks=seq(0, 720, 30)) require(ggplot2) ggplot(art, aes(ftta, fill=Jo)) + geom_bar() + theme_bw() + theme(axis.text.x=element_text(angle=45, hjust=.5, vjust=.7), legend.position=c(.7, .6)) + labs(x="Time to acceptance (days)", y="Count") ``` First you compute the time period between received and accepted dates. This is done by converting the numeric values of year, month, and day to an actual date using the ISOdate() function and then using the difftime() function to compute the time difference in days. ```{r computeTau} rec = ISOdate(art$RecYr, art$RecMo, art$RecDay) acc = ISOdate(art$FinYr, art$FinMo, art$FinDay) tau = difftime(acc, rec, units="days") ``` You call the temporal difference the time-to-acceptance ($\tau$). This is the statistic of interest. The mean $\tau$ is `r I(round(as.integer(mean(tau[art$FinYr==2008])),0))` in 2008 (type mean(tau[art$FinYr==2008])), `r I(round(as.integer(mean(tau[art$FinYr==2009])),0))` in 2009, and `r I(round(as.integer(mean(tau[art$FinYr==2010])),0))` in 2010. The mean $\tau$ for each of the four journals with the most articles is `r I(round(as.integer(mean(tau[art$Jo=="MWR"])),0))` (*Monthly Weather Review*), `r I(round(as.integer(mean(tau[art$Jo=="JAS"])),0))` (*Journal of the Atmospheric Sciences*), `r I(round(as.integer(mean(tau[art$Jo=="JC"])),0))` (*Journal of Climate*), and `r I(round(as.integer(mean(tau[art$Jo=="WF"])),0))` (*Weather and Forecasting*). On average the *Journal of Climate* is slowest and *Monthly Weather Review* is fastest, but the difference is less than 3.5 weeks. There are many factors that influence the value of $\tau$. On the editor's side there is pre-review, review requests, dispensation decision among others. On the reviewer's side there is work load as well as breaks for travel and vacation, and on the author's side there is the effort needed to revise and resubmit. The goal is a predictive distribution for $\tau$. This will allow you to make inferences about the time-to-acceptance for future manuscript submissions. You assume $\tau$ is a random variable having a gamma density given by $$ f(\tau|\alpha , \beta) = \frac{\tau^{\alpha -1}{\exp(\frac{-\tau}{\beta})}}{{\beta^\alpha}\Gamma(\alpha)} $$ where $\alpha$ and $\beta$ are the shape and scale parameters, respectively and $\Gamma(\alpha) = (\alpha-1)!$. The gamma density is commonly used to model time periods (wait times, phone call lengths, etc). If you place a uniform prior distribution on the parameter vector $(\alpha, \beta)$ the posterior density is given by $$ g(\alpha, \beta | \tau) \propto g(\alpha, \beta) f(\tau|\alpha, \beta) $$ The uniform prior is consistent with a judgement that the parameters are the same regardless of author or journal. Random draws from this joint posterior density are summarized and also used to draw predictive samples for $\tau$ from a gamma density. To make the computation easier the posterior density is reformulated in terms of $\log \alpha$ and $\log \mu$, where $\mu = \alpha \times \beta$ is the posterior mean of $\tau$. The posterior density is available in the gsp.R file (from Jim Albert). It is available from http://www.hurricaneclimate.com/chapter-4/gsp.R. Download it and bring it into R by typing ```{r sourceGammaPosterior} source("gsp.R") gsp <- function(theta, data){ a = exp(theta[1]) mu = exp(theta[2]) n = length(data) val = 0 * a for (i in 1:n) val=val + dgamma(data[i], shape=a, scale=mu/a, log=TRUE) return(val - log(a) + log(a) + log(mu)) } ``` Include quotes around the file name. Given the parameter values ($\log \alpha, \log \mu$) the function computes the posterior probability given the data and the posterior density using $$ \log g(\alpha, \frac{\mu}{\alpha} | \tau) = \log \mu + \sum_{i=1}^n \log f(\tau_i|\alpha, \frac{\mu}{\alpha}) $$ It performs this computation using pairs of parameters defined as a two-dimensional grid spanning the domain of the posterior. Alternatively, you could use a MCMC approach to compute the posterior density. This would allow you to use non-uniform priors for the parameters. The gsp() function is used to determine the posterior density over a two-dimensional grid of parameter values (on log scales) using the mycontour() function (**LearnBayes**). It is also used to sample from the posterior using the simcontour() function. Using these functions you contour the joint posterior of the two parameters and then add 1000 random draws from the distribution by typing ```{r parameterPosterior} lim = c(.8, 2.1, 5, 5.45) mycontour(gsp, limits=lim, data=tau, xlab="log alpha", ylab="log mu") s = simcontour(gsp, lim, data=tau, m=1000) points(s$x, s$y, pch=19, col="gray", cex=.5) ``` These computations take a few seconds to complete. The contours from inside-out are the 10%, 1%, and 0.1% probability intervals on a log scale. ```{r parameterposteriorFig} par(mar=c(5, 4, 4, 2) + .1, mgp=c(2, .5, 0), tcl=-.3) plot(lim[1:2], lim[3:4], type="n", las=1, xlab="log($\\alpha$)", ylab="log($\\mu$)") grid() points(s$x, s$y, pch=19, col="gray", cex=.5) mycontour(gsp, limits=lim, data=tau, add=TRUE) ``` The graph shows the mean time to acceptance is about 181 days [on the ordinate exp(5.2)]. This might be useful to an editor. But the posterior gives additional information. For example, the editor can use your model to estimate the probability that the average time to acceptance will exceed 200 days for the set of manuscripts arriving next month. Since the model provides draws, the question is answered by finding the percentage of draws exceeding the logarithm of 200. Assuming the set of manuscripts is a random sample the model predicts `r I(round(sum(s$y>log(200))/1000*100,1))`%. These averages are not particularly useful to an author. An author would like to know if a recently submitted manuscript will be accepted in less than 120 days. To answer this question, the author takes random draws of $\tau$'s from a gamma density using the random draws from the posterior parameter distribution. This is done with the rgamma() function and then finding the percentage of these draws less than this many days. ```{r posteriorTau} alpha = exp(s$x) mu = exp(s$y) beta = mu/alpha taum = rgamma(n=1000, shape=alpha, scale=beta) ``` Here the model predicts a probability of `r I(round(sum(taum<120)/1000*100,1))`%. Note that this probability is lower than the average percentage less than 120 days as it includes additional uncertainty associated with modeling an individual estimate rather than a parameter estimate. Changes to review rules and manuscript timetables and tracking will influence time to acceptance. To the extent that these changes occur during the period of data collection or subsequently, they will influence your model's ability to accurately anticipate time to acceptance. Model fit is checked by examining quantile statistics from the data against the same statistics from the posterior draws. For instance, the percentage of articles in the data with $\tau$ less than 90 days is `r I(round(sum(tau<90)/length(tau)*100,1))`, which compares with a percentage of `r I(round(sum(taum<90)/length(taum)*100,1))` from the posterior draws. Continuing, the percentage of articles with $\tau$ longer than 360 days from the data is `r I(round(sum(tau>360)/length(tau)*100,1))`, which compares with a percentage of `r I(round(sum(taum>360)/length(taum)*100,1))` from the posterior draws. The model has practical applications for authors wishing to meet deadlines. As an example, in order for research to be considered by the 5th Assessment Report of the Intergovernmental Panel on Climate Change (IPCC) authors must have the manuscript accepted for publication by March 15, 2013. ```{r probvsdatefig} s = simcontour(gsp, c(.8, 2.1, 5, 5.45), tau, 1000) pu = numeric() pl = numeric() pm = numeric() for(t in 1:361){ p = numeric() for(j in 1:1000){ alpha = exp(s$x) mu = exp(s$y) beta = mu/alpha taum = rgamma(n=1000, shape=alpha, scale=beta) p[j] = sum(taum < t)/1000 } pu[t] = quantile(p, prob=.975) pl[t] = quantile(p, prob=.025) pm[t] = quantile(p, prob=.5) } x = as.Date("March 16, 2013", format='%B %d, %Y') t = seq(361, 1, -1) days = x - t pur = rev(pu); plr = rev(pl); pmr = rev(pm) par(mar=c(6, 6, 3, 2) + .1, las=2, mgp=c(2.4, 1, 0), tcl=-.3) plot(days, pmr, pch=19, type="l", lwd=2, cex=.6, ylim=c(0, 1), ylab="Probability of acceptance\n before March 15, 2013", xlab="", xaxt="n") xx = as.numeric(days[seq(1, 361, 30)]) yy = par("usr")[3] - .05 axis(1, at=xx, labels=rep("", length(xx))) text(xx, yy, labels=as.character(days[seq(1, 361, 30)]), srt=45, adj=c(1, 1), xpd=TRUE, cex=.8) mtext("Submit date [year-month-day]", 1, line=4, las=1) abline(h=seq(0, 1, .2), col="lightgray", lty="dotted") abline(v=xx, col="lightgray", lty="dotted") xxx = c(days, rev(days)) yyy = c(plr, rev(pur)) polygon(xxx, yyy, border=NA, col="gray") lines(days, pmr, lwd=2) ``` The graph in shows your model's predictive probability of meeting this deadline versus date. Results are based on 1000 random draws from a gamma distribution where the scale and shape parameter are derived from a 1000 draws from the joint posterior given the data. The probability is the percentage of posterior $\tau$ draws less than $\tau_o$ days from March 15, 2013. The 95% credible interval shown by the gray band is obtained by repeating the 1000 draws from the gamma distribution 1000 times and taking the 0.025 and 0.975 quantile values of the probabilities. The probability is high in the early months of 2012 when the deadline is still a year in the future. However, by mid September of 2012 the probability drops below 50% and by mid January 2013 the probability is less than 10%. The predictive probabilities from your model reflect what can be expected if an arbitrary author submits to an arbitrary AMS journal with `hurricane' in the title under the assumption that the paper will be accepted. Certain journals and authors could have a faster or slower turn-around time. For instance, we note that the two papers in the data with Kossin as lead author have $\tau$'s of 84 and 56 days, values which are substantially smaller than the mean. In this case you need to use a hierarchical model to accommodate author and journal differences. With a hierachical model you use a Markov chain Monte Carlo approach or approximation methods to obtain the posterior probabilities. Markov Chain Monte Carlo ------------------------ The MCMC approach generates samples from your posterior distribution. It is flexible and is one reason behind the recent popularity of Bayesian approaches. Suppose your parameter vector of interest is $\theta= (\theta_1, \theta_2, \ldots, \theta_p)$. The joint posterior distribution of theta, which is denoted [$\theta$| data] may be of high dimension and difficult to summarize. Instead suppose you define the set of conditional distributions as $$ \left[\theta_1| \theta_2, \ \ldots, \theta_p, \ \hbox{data}\right], \left[\theta_2| \theta_1, \theta_3 \ \ldots, \theta_p, \ \hbox{data}\right], \cdots \left[\theta_p| \theta_1, \ \ldots, \theta_{p-1}, \ \hbox{data}\right], $$ where [$X$|$Y,Z$] represents the distribution of $X$ conditional on values of random variables $Y$ and $Z$. The idea is that you can set up a Markov chain from the joint posterior distribution by simulating individual parameters from the set of $p$ conditional distributions. Drawing one value for each individual parameter from these distributions in turn is called one update (iteration) of a Gibbs sampler. Under general conditions, draws will converge to the target distribution (joint posterior of $\theta$). Unfortunately this theoretical result provides no practical guidance on how to decide if the simulated sample provides a reasonable approximation to the posterior density (Jackman 2009). This must be done on a model-by-model case. ### JAGS A popular general purpose MCMC software that implements Gibbs sampling is WinBUGS (Windows version of Bayesian inference Using Gibbs Sampling) (Lunn et al 2000). It is stand-alone with a Graphical User Interface (GUI). JAGS (Just Another Gibbs Sampler) is an open-source project written in C++. It runs on any computing platform and can be accessed from R with functions from the **rjags** package (Plummer 2011). Download JAGS from http://mcmc-jags.sourceforge.net/ Here you use JAGS on the earlier problem of making inferences about the proportion of U.S. landfalls. The example is simple and an MCMC algorithm is not necessary, but it's instructive for you to see the work flow. To begin, download and install the latest version of JAGS. This is C++ code that gets installed outside of R. Next open a text file and write the model code in a language that JAGS understands. You can copy and paste from here. Call the file **JAGSmodel.txt** (http://www.hurricaneclimate.com/storage/chapter-4/JAGSmodel.txt). ___JAGS code___ model { h ~ dbin(pi, n) #likelihood pi ~ dbeta(a, b) #prior a <- 3.26 b <- 7.19 } Note the similarity of the syntax to the syntax of R. Every JAGS (and BUGS) program begins with a model statement followed by an opening curly brace and ends with an enclosing curly brace. The second line specifies the landfall count h using the tilde character which denotes a stochastic relation. The stochastic relation is of the form node ~ ddens(arguments) where node is your parameter and ddens is the name of a function calling one of the distributions supported in JAGS. These functions all start with the letter d, as here in dbin for the binomial mass function and dbeta for the beta density used in third line for your prior. Here the values of the parameters a and b are the same as you used earlier in the chapter and are set deterministically using the <- assignment operator. JAGS and BUGS are declarative languages. R is a procedural language. The JAGS syntax specifies the model, but it does not define the computational steps. When the JAGS model is compiled, the model syntax is turned into a set of sequential instructions, but this is hidden from you. Because of this, the statement order is not important. What matters is that the compiler can resolve the names and links in the model. Think of the JAGS model as a directed acyclic graph (DAG). ```{r dag} require(grid) hc = grid.circle(r=.05, x=.6, y=.2) grid.text("h", x=.6, y=.2) pic = grid.circle(r=.05, x=.4, y=.5) grid.text(expression(pi), x=.4, y=.5) nc = grid.circle(r=.05, x=.8, y=.5) grid.text("n", x=.8, y=.5) grid.curve(.43, .45, .55, .25, arrow=arrow(type="closed", length=unit(2, "mm")), angle=10, curvature=0, gp=gpar(fill="black")) grid.curve(.77, .45, .65, .25, arrow=arrow(type="closed", length=unit(2, "mm")), angle=10, curvature=0, gp=gpar(fill="black")) ac = grid.circle(r=.05, x=.2, y=.8) grid.text("a", x=.2, y=.8) bc = grid.circle(r=.05, x=.6, y=.8) grid.text("b", x=.6, y=.8) grid.curve(.23, .75, .35, .55, arrow=arrow(type="closed", length=unit(2, "mm")), angle=10, curvature=0, gp=gpar(fill="black")) grid.curve(.57, .75, .45, .55, arrow=arrow(type="closed", length=unit(2, "mm")), angle=10, curvature=0, gp=gpar(fill="black")) ``` Nodes are the parameters and data and arrows indicate the conditional dependency, either stochastic or deterministic. Hurricane landfall $h$ depends on the number of hurricanes $n$ and the proportion $\pi$, where the proportion $\pi$ depends on your prior assumptions encoded in the beta parameters $a$ and $b$. Landfall proportion $\pi$ is conditionally dependent on the scale (a) and shape (b) values through the beta density. The number of landfalls $h$ is dependent on the basin-wide count $n$ and the proportion of landfalls $\pi$ through the binomial distribution. The conditional independence structure of your model is easier to see when represented as a DAG. Note that BUGS allows you to generate model code from a DAG. In R type ```{r requireJAGS, message=FALSE} require(rjags) ``` You call JAGS from R with the jags.model() function, ```{r callJAGS} model = jags.model('http://www.hurricaneclimate.com/storage/chapter-4/JAGSmodel.txt', data = list('h'=4, 'n'=34), inits = list('pi'=.5, .RNG.seed=3042, .RNG.name="base::Super-Duper")) ``` The function sends the model in file *JAGSmodel.txt* to JAGS for parsing and compiling. The data argument specifies the number of landfalling hurricanes $h$ and number of hurricanes $n$ as a list. The inits argument directs JAGS to initialize the MCMC although that is not necessary with this simple model. The last two elements in the inits list specify the random number generator (RNG) and an initial seed value for the generator. This allows the results to be replicated as the sequence of random numbers will be identical. The object model is of class jags. By default the function generates 1000 samples (updates) of the MCMC algorithm. This is enough for successive values of $\pi$ to depart from the initial value and reach the posterior density. In fact, this 'burn-in' is instantaneous with this simple model. If more iterations are needed you specify the number in the function with the argument n.adapt or you can use the update() method on the model object by typing, ```{r moreIterations, eval=FALSE} update(model, 1000) ``` An additional 1000 iterations from the MCMC are generated but not saved. Finally to generate posterior samples of $\pi$ and save them, you use the coda.samples() function. It continues updating the chain for the number of iterations specified by n.iter, but this time it saves them if the parameter is listed in the variable.names argument. ```{r generateSamples} out = coda.samples(model, variable.names='pi', n.iter=1000) ``` CODA stands for COnvergence Diagnostic and Analysis. It describes a set of functions for analyzing outputs generated from MCMC. The object returned is of class mcmc.list. You use the summary() method on this object to obtain a summary of the posterior distribution for $\pi$. ```{r summaryoutput} summary(out) ``` The output begins with the characteristics of your MCMC and then provides statistics on the samples from the posterior distribution. The posterior mean indicating the proportion of all North Atlantic hurricanes that make landfall in the United States is `r I(round(summary(out)$statistics[1],2))`. Quantiles of $\pi$ from the posterior samples indicate the 95% credible interval for the proportion is (`r I(round(summary(out)$quantiles[1],2))`, `r I(round(summary(out)$quantiles[5],2))`). You obtain other statistics from the posterior samples by extracting the array corresponding to $\pi$ from the MCMC list and performing the appropriate computations. For instance, given your prior and your data, how likely is it that the population percentage of land falls is less than or equal to 0.25? This is answered by typing ```{r populationPercentage} pi = as.array(out) sum(pi <= .25)/length(pi) ``` Here you extract the vector from the array using as.array(). The answer compares well with what you obtained earlier from the conjugate model. Additional analysis can be done with the posterior samples and the plot method will produces a trace and density plot of your $\pi$ values by typing, ```{r plotOutput} plot(out) ``` The trace plot shows your MCMC samples versus the simulation index. It is useful for assessing chain convergence. Although the samples vary the mean and variance are relatively constant indicating convergence. This is expected with a simple model. The density plot indicates the distribution of all the samples. The MCMC approach is flexible and it allows you to easily answer related inferential questions. For instance, suppose you are interested in the probability that any two consecutive hurricanes that form in the North Atlantic will hit the United States. You can add a node to your model of the form pi2 <- pow(pi,2). This is changed in your file *JAGSmodel.txt*. You then monitor this node and draw inferences from the posterior samples. ```{r nextTwoHurricanes} model = jags.model('http://www.hurricaneclimate.com/storage/chapter-4/JAGSmodel.txt', data=list('h'=4, 'n'=34), inits=list('pi'=.5)) update(model, 1000) out = coda.samples(model, variable.names='pi2', n.iter=1000) summary(out) ``` The median posterior probabilty indicates only a `r I(round(summary(out)$quantiles[3],3)*100)`% chance of consecutive hurricanes hitting the United States. The result is based on the assumption that consecutive hurricanes are independent.