Transcript Document
1 Review of Bayesian Analysis and Modeling James B. Elsner Department of Geography, Florida State University 2 Bayesian analysis and modeling using MCMC is good for you because… It has a solid decision-theoretical framework. Output from a probability model can be linked naturally to a utility model. It is intuitive. Combines the prior distribution (prior beliefs and/or experience) with the likelihood (experiment) to obtain the posterior distribution (accumulated information). In this context, knowledge is equivalent to a distribution, and knowledge precision can be quantified by the precision parameter. It is exact. Asymptotic approximations are not used. The “plug-in” principle is avoided. Computations are tractable for practically all models. Focus shifts from model estimation to model correctness. 3 AnticipatingFlorida Hurricanes James B. Elsner and Thomas H. Jagger, Florida State University This past seasonof unprecedentedFlorida hurricaneactivityw asnot entirelya consequenceof bad luck. A notable preseasonw eakeningof the sea-levelpressure gradient over the easternNorth Atlantic foreshadow edthe onslaughtof storms consistent w ithour researchshow inga negativerelationshipbetw eenspringtime valuesof the North Atlantic oscillation (NAO) and hurricaneactivityover the U.S. coast (Elsner et al.~2001; Jagger et al.~2001), especiallyalong the southeastthat includes Florida (Elsner 2003). Here w edesign a hierarchicalBayesianmodel for annualFlorida hurricanecountsthat includes a preseasonMay-Junevaluefor the NAO as a covariateand that assigns greater uncertaintyto the countsprior to 1900. Using an averageindex valuefor May-Juneof 2004, the model hindcastsan activeseasonfor Florida. The model show inga 45% chanceof at least one Florida hurricanefor 2004 is in bold contrastto the most recent53-yearhistorical probability of 30%. Introduction Doodle Code Data Initial Conditions Results http://garnet.fsu.edu/~jelsner/www/research.html Anticipating Florida Hurricanes, November 2004. 4 Frequentist Definition of probability Point estimate Confidence intervals of parameters Confidence intervals of nonparameters Model selection Difficulties Long-run expected frequency in repeated (actual or hypothetical) experiments. Maximum likelihood estimate. Bayesian Relative degree of belief in the state of the world. Mean, mode, or median of the posterior probability distribution. Based on the likelihood ratio test, which is based on the expected probability Credible intervals based on the distribution of the maximum likelihood posterior probability distribution. over many experiments. Based on likelihood profile, or by resampling from the sampling distribution of the parameter. Calculated directly from the distribution of parameters. Discard terms that are not significantly Retain terms in models on the different from a null model (nested) at a argument that processes are not previously set confidence level. absent simply because they are not CI are confusing (range which will contain true value in proportion a of repeated experiments); rejection of model terms for non-significance. statistically significant. Subjectivity; need to specify priors. 5 Why use Bayesian analysis and models? When you have good (informative) prior information that you want to use (from previous experiments, or reliable information from literature). Frequentist meta-analysis can also be used in this case, but the Bayesian approach is natural. When you want to specify the distribution of non-parameters (future values), Bayesian models work well. In particular, if you are going to try to make a practical decision based on your model (using decision theory), for example picking a management regime that minimizes expected risks or maximizes expected returns, having the posterior distribution of all the parameters makes the decision calculations easier. Complicated models such as hierarchical models or models with missing or unobserved data can be particularly conveniently fit using Bayesian methods. Frequentists call these random-effect or mixed models and have their own estimation procedures. 6 How Gibbs Sampling Works The contours in the plot represent the joint distribution of q and the labels (0), (1) etc., denote the simulated values. Note that one iteration of the algorithm is complete after both components are revised. Also notice that each component is revised along the direction of the coordinate axes. This feature can be a source of problems if the two components are correlated because then the contours get compressed and movements along the coordinate axes tend to produce small moves. q2 Gibbs sampling algorithm in two dimensions starting from an initial point and then completing three iterations. (3) (2) (1) (0) q1 7 After the model has converged, samples from the conditional distributions are used to summarize the posterior distribution of parameters of interest. Convergence refers to the idea that eventually the Gibbs samples will reach a stationary distribution. The question is: 1) How many samples are needed before convergence? The samples discarded before convergence are called “burn-in”. 2) After convergence, how many samples are needed to summarize the posterior distribution? Answers vary from model to model and diagnostics are used to examine the samples for convergence. Possible convergence problems. The assumed model may not be realistic from a substantive point of view or may not fit. Errors in calculation or programming. Often, simple syntax mistakes may be responsible; however, it is possible that the algorithm may not converge to a proper distribution. Slow convergence: this is the problem we are most likely to run into. The simulation can remain for many iterations in a region heavily influenced by the starting distribution. If the iterations are used to summarize the target distribution, they can yield falsely precise estimates. 8 Diagnostics for examining convergence. One intuitive and easily implemented diagnostic tool is a trace plot (or history plot) which plots the parameter value as a function of sample number. If the model has converged, the trace plot will move up and down around the mode of the distribution. A clear sign of non-convergence occurs when we observe some trending in the trace plot. In WinBugs, you can setup trace plots to monitor parameters while the program runs. Trace plots convergence non-convergence 9 An autocorrelation plot will reveal whether or not there is correlation between successive samples. The presence of correlation indicates that the samples are not effective in moving through the entire posterior distribution. WinBUGS plots the autocorrelation out to 50 lags. betaD[1] No large autocorrelation 1.0 0.5 0.0 -0.5 -1.0 0 20 40 lag mu.b[1] c hains 1:2 Autocorrelation 1.0 0.5 0.0 -0.5 -1.0 0 20 40 lag 10 If the model has converged, additional samples from a parameter’s posterior distribution should not influence the calculation of the mean. Running means of the samples will reveal if the posterior mean has settled to a particular value. A “lumpy” posterior may indicate non-convergence. It may be necessary to let the sampler generate additional samples. mu.b[8] chains 1:2 sample: 128 30.0 20.0 10.0 0.0 0.3 0.35 0.4 Geweke Time-Series Diagnostic: If a model has converged, then the series of samples from the first half of the chain will have the same “time-series” properties of the series of samples from the second half of the chain. 11 Gelman-Rubin Diagnostic: Another way to identify convergence is to simulate multiple sequences from different starting points. The behavior of the sequence of chains should be the same. That is, the variance within the chains should be the same as the variance across the chains. alpha0 chains 1:2 convergence 0.5 0.0 -0.5 -1.0 -1.5 101 200 400 600 iteration alpha0 chains 1:2 non-convergence 10.0 7.5 5.0 2.5 0.0 -2.5 101 200 400 iteration 600 12 If more than 1 chain is initialized, the Gelman-Rubin statistic is reported in WinBugs. The statistic is based on the following procedure: 1) estimate your model with a variety of different initial values and iterate for an n-iteration burn-in and an n-iteration monitored period. 2) take the n-monitored draws of m parameters and calculate the following statistics: m n 1 i W it hinchain varianceW q j q j m( n 1) j 1 i 1 n m Bet ween chain varianceB q j q m 1 j 1 2 2 1 1 ˆ Est imat edvarianceV (q ) 1 W B n n Vˆ (q ) T heGelman - Rubin St at ist ic R W Before convergence, W underestimates total posterior variance inq because it has not fully explored the target distribution. V(q) on the other hand overestimates variance in q because the starting points are overdispersed relative to the target. 13 Once convergence is reached, W and V(q) should be almost equivalent because variation within the chains and variations between the chains should coincide, so R should approximately equal one. b[6] chains 1:2 1.0 0.5 0.0 1 500 iteration The normalized width of the central 80% interval of the pooled runs is green The normalized average width of the 80% intervals within the individual runs is blue R is red. R would generally be expected to be greater than 1 if the starting values are suitably over-dispersed. Brooks and Gelman (1998) emphasize that one should be concerned both with convergence of R to 1, and with convergence of both the pooled and within interval widths to stability. 14 Speeding up the sampling Convergence sometimes requires generating 10s or 100s of thousands of samples. If this is the case, it is important to speed up the sampling. Standardize your input variables. Subtract the mean and divide by the standard deviation. This decreases the correlation between model variables which can decrease the autocorrelation between samples. Use WinBugs’ over-relaxation algorithm. This generates multiple samples at each iteration and then selects one that is negatively correlated with the current value. The time per iteration will increase, but the within-chain correlations should be reduced and hence fewer iterations may be necessary. However, this method is not always effective and should be used with caution. The autocorrelation function should be used to check whether the mixing of the chain improves. Pick good initial values. If your initial values are near their posterior modes, then convergence should occur relatively quickly, especially if there is not a big problem with autocorrelation. Just wait. With models involving many parameters it may take 100s of thousands of samples. Using the “thin” option, you do not need to save all the sample values for all the parameters.