Accelerating inference for complex stochastic models using Approximate Bayesian Computation with an application to protein folding Umberto Picchini Centre for Mathematical Sciences, Lund University joint work with Julie Forman (Dept. Biostatistics, Copenhagen University) Dept. Mathematics, Uppsala 4 Sept. 2014 Umberto Picchini (umberto@maths.lth.se)
Outline I’ll use protein folding data as a motivating example; most of the talk will be about statistical inference issues. I will introduce our data and protein folding problem. I will introduce a model based on stochastic differential equations. I will mention some theoretical and computational issues related to (Bayesian) inference. I will introduce approximate Bayesian computation (ABC) to alleviate methodological and computational problems. Umberto Picchini (umberto@maths.lth.se)
A motivating example Proteins are synthesized in the cell on ribosomes as linear, unstructured polymers ...which then self-assemble into specific and functional three-dimensional structures. Umberto Picchini (umberto@maths.lth.se)
This self-assembly process is called protein folding. It’s the last and crucial step in the transformation of genetic information, encoded in DNA, into functional protein molecules. Umberto Picchini (umberto@maths.lth.se)
protein folding is also associated with a wide range of human diseases. In many neurodegenerative diseases, such as Alzheimers disease, proteins misfold into toxic protein structures. Protein folding has been named “the Holy Grail of biochemistry and biophysics” (!). Umberto Picchini (umberto@maths.lth.se)
Modelize time dynamics is difficult (large number of atoms in a 3D space); Atoms coordinates are usually projected onto a single dimension called reaction coordinate see the figure below 0 0.5 1 1.5 2 2.5 x 10 4 20 25 30 35 40 index Figure: Data time-course projected on a single coordinate: 25,000 measurements of the L-reaction coordinate of the small Trp-zipper protein at sampling freq. ∆−1 = 1/nsec. here the L-reaction coordinate was used, i.e. the total distance to a folded reference. notice the random switching between folded/unfolded states. Umberto Picchini (umberto@maths.lth.se)
Modelize time dynamics is difficult (large number of atoms in a 3D space); Atoms coordinates are usually projected onto a single dimension called reaction coordinate see the figure below 0 0.5 1 1.5 2 2.5 x 10 4 20 25 30 35 40 index Figure: Data time-course projected on a single coordinate: 25,000 measurements of the L-reaction coordinate of the small Trp-zipper protein at sampling freq. ∆−1 = 1/nsec. here the L-reaction coordinate was used, i.e. the total distance to a folded reference. notice the random switching between folded/unfolded states. Umberto Picchini (umberto@maths.lth.se)
Forman and Sørensen [2013] proposed to consider sums of diffusions:1 Zt observable process = Yt latent state + Ut autocorrelated error term they considered diffusion processes to modelize both Yt and Ut they found i.i.d. errors where not really giving satisfactory results. So let’s introduce some autocorrelation: dUt = −κUtdt + 2κγ2dWt, U0 = 0 a zero mean Ornstein-Uhlenbeck process with stationary variance γ2 and autocorrelation ρU(t) = e−κt. Here dWt ∼ N(0, dt). 1 Forman and Sørensen, A transformation approach to modelling multi-modal diffusions. J. Statistical Planning and Inference. 2014. Umberto Picchini (umberto@maths.lth.se)
Regarding the “‘signal” part Yt: data clearly shows a bimodal marginal structure: 0 0.5 1 1.5 2 2.5 x 10 4 20 25 30 35 40 index 22 24 26 28 30 32 34 36 38 40 0 500 1000 1500 2000 2500 data so we want a stochastic process that is able to switch between the two “modes” (i.e. the folded-unfolded states). One of possible options: an OU Xt with zero mean and unit variance dXt = −θXtdt + √ 2θdBt, X0 = x0 plug each Xt into the cdf of its stationary N(0,1) distribution ⇒ take Φ(Xt) as the cdf of N(0, 1). Umberto Picchini (umberto@maths.lth.se)
now build a Gaussian mixture and take the percentile corresponding to an area Φ(Xt) to summarize: simulate Xt ⇒ compute Φ(Xt) ⇒ find percentile Yt from a 2-components Gaussian mixture with cdf F(y) = α · Φ y − µ1 σ1 + (1 − α) · Φ y − µ2 σ2 Umberto Picchini (umberto@maths.lth.se)
So in conclusion we have: Zt data = Yt latent state + Ut autocorrelated error term    Zt = Yt + Ut Yt := τ(Xt) dXt = −θXtdt + √ 2θdBt dUt = −κUtdt + 2κγ2dWt τ(x) = (F−1 ◦Φ)(x), F(y) = α·Φ y − µ1 σ1 +(1−α)·Φ y − µ2 σ2 We are interested in conducting (Bayesian) inference for η = (θ, κ, γ, α, µ1, µ2, σ1, σ2) Umberto Picchini (umberto@maths.lth.se)
Difficulties with exact Bayesian A non-exhaustive list on the difficulty of using exact Bayesian inference via MCMC and SMC in our application: our dataset is “large” (25,000 observations...) which is not “terribly” large in absolute sense, but it is when dealing with diffusion processes... ...in fact even when a proposed parameter value is in the bulk of the posterior distribution, generated trajectories might still be too distant from the data (⇒ high rejection rate!) a high rejection rate implies poor exploration of the posterior surface, poor inferential results and increasing computational time. some of these issues can be mitigated using bridging techniques (Beskos et al. ’13): not trivial in our case (transformation τ(x) unknown in closed form). Umberto Picchini (umberto@maths.lth.se)
Before going into approximated methods...you should trust (!) that we have put lots of effort in trying to avoid approximations! Still... Beside theoretical difficulties, currently existing methods do not scale well for large data: e.g. we attempted at using particle MCMC (Andrieu et al. ’10) and even with a few (10 only!) particles it would require weeks to obtain results... it is expensive to simulate from our model as percentiles of mixture-models are unknown in closed form. we had to use some approximated strategy, and we considered ABC (approximate Bayesian computation). Umberto Picchini (umberto@maths.lth.se)
Notation Data: z = (z0, z1, ..., zn) unknown parameters: η = (θ, κ, γ, α, µ1, µ2, σ1, σ2) Likelihood funct.: p(z|η) Prior density: π(η) (our a priori knowledge of η) Posterior density: π(η|z) ∝ π(η)p(z|η) Ideally we would like to use/sample from the posterior. We assume this is either theoretically difficult or computationally expensive (it is in our case!). Umberto Picchini (umberto@maths.lth.se)
Approximate Bayesian computation (ABC) ABC gives a way to approximate a posterior distribution π(η|z) key to the success of ABC is the ability to bypass the explicit calculation of the likelihood p(z|η) ...only forward-simulation from the model is required! ABC is in fact a likelihood-free method that works by simulating pseudo-data zsim from the model: zsim ∼ p(z|η) had an incredible success in genetic studies since mid 90’s (Tavare et al ’97, Pritchard et al. ’99). lots of hype in recent years: see Christian Robert’s excellent blog. Umberto Picchini (umberto@maths.lth.se)
Basic rejection sampler (NO approximations here!) for r = 1 to R do repeat Generate parameter η from its prior distribution π(η) Generate zsim from the likelihood p(z|η ) (!! no need to know p(·) analytically!!) until zsim = z (simulated data = actual data) set ηr = η end for The algorithm produces R samples from the exact posterior π(η|z). :( It won’t work for continuous data or large amount of data because Pr(z = zsim) ≈ 0 ⇒ substitute z = zsim with z ≈ zsim Umberto Picchini (umberto@maths.lth.se)
...⇒ substitute z = zsim with z ≈ zsim Introduce some distance z − zsim to measure proximity of zsim and data z [Pritchard et al. 1999]. Introduce a tolerance value δ > 0. An ABC rejection sampler: for r = 1 to R do repeat Generate parameter η from its prior distribution π(η) Generate zsim from the likelihood p(z|η ) until z − zsim < δ [or alternatively S(z) − S(zsim) < δ] set ηr = η end for for some “summary statistics” S(·). Umberto Picchini (umberto@maths.lth.se)
Previous algorithm samples from approximate posteriors π(η| z − zsim < δ) or π(η| S(z) − S(zsim) < δ. useful to consider statistics S(·) when dealing with large datasets to increase the probability of acceptance. the key result of ABC When S(·) is “sufficient” for η and δ ≈ 0 sampling from the posterior is (almost) exact! When S(·) is sufficient for the parameter ⇒ π(η| S(zsim) − S(z) < δ) ≡ π(η| zsim − z < δ) when δ = 0 and S sufficient we accept only parameter draws for which zsim ≡ z ⇒ π(η|z), the exact posterior. This is all good and nice, but such conditions rarely hold. Umberto Picchini (umberto@maths.lth.se)
Previous algorithm samples from approximate posteriors π(η| z − zsim < δ) or π(η| S(z) − S(zsim) < δ. useful to consider statistics S(·) when dealing with large datasets to increase the probability of acceptance. the key result of ABC When S(·) is “sufficient” for η and δ ≈ 0 sampling from the posterior is (almost) exact! When S(·) is sufficient for the parameter ⇒ π(η| S(zsim) − S(z) < δ) ≡ π(η| zsim − z < δ) when δ = 0 and S sufficient we accept only parameter draws for which zsim ≡ z ⇒ π(η|z), the exact posterior. This is all good and nice, but such conditions rarely hold. Umberto Picchini (umberto@maths.lth.se)
a central problem is how to choose the statistics S(·): outside the exponential family we typically cannot derive sufficient statistics. [A key work to obtain “semi-automatically” statistics is Fearnhead-Prangle ’12 (discussion paper on JRSS-B. Very much recommended.)] substitute with the loose concept of informative (enough) statistic then choose a small (enough) threshold δ. We now go back to our model and (large) data. We propose some trick to accelerate the inference. We will use an ABC within MCMC approach (ABC-MCMC). Umberto Picchini (umberto@maths.lth.se)
a central problem is how to choose the statistics S(·): outside the exponential family we typically cannot derive sufficient statistics. [A key work to obtain “semi-automatically” statistics is Fearnhead-Prangle ’12 (discussion paper on JRSS-B. Very much recommended.)] substitute with the loose concept of informative (enough) statistic then choose a small (enough) threshold δ. We now go back to our model and (large) data. We propose some trick to accelerate the inference. We will use an ABC within MCMC approach (ABC-MCMC). Umberto Picchini (umberto@maths.lth.se)
ABC-MCMC Example: Weigh discrepancy between observed data and simulated trajectories using a uniform 0-1 kernel Kδ(·), e.g. Kδ(S(zsim), S(z)) = 1 if S(zsim) − S(z) < δ 0 otherwise complete freedom to choose a different criterion... Use such measure in place of the typical conditional density of data given latent states. See next slide. Umberto Picchini (umberto@maths.lth.se)
   Zt = Yt + Ut with Yt := τ(Xt) dXt = −θXtdt + √ 2θdBt dUt = −κUtdt + 2κγ2dWt ABC-MCMC acceptance ratio: Given the current value of parameter η ≡ ηold generate a Markov chain via Metropolis-Hastings: Algorithm 1 a generic iteration of ABC-MCMC (fixed threshold δ) At r-th iteration 1. generate η ∼ u(η |ηold), e.g. using Gaussian random walk 2. generate zsim|η ∼ “from the model – forward simulation” 3. generate ω ∼ U(0, 1) 4. accept η if ω < min 1, π(η )K(S(zsim),S(z))u(ηold|η ) π(ηold)K(S(zold),S(z))u(η |ηold) then set ηr = η else ηr = ηold Samples are from π(η, zsim| S(zsim) − S(z) < δ). Umberto Picchini (umberto@maths.lth.se)
Algorithm 2 a generic iteration of ABC-MCMC (random threshold δ) At r-th iteration 1. generate η ∼ u(η |ηold), δ ∼ v(δ|δold) 2. generate zsim|η ∼ “from the model – forward simulation” 3. generate ω ∼ U(0, 1) 4. accept η if ω < min 1, π(η )Kδ(S(zsim),S(z))u(ηold|η )v(δold|δ ) π(ηold)Kδ(S(zold),S(z))u(η |ηold)v(δ |δold) then set (ηr, δr) = (η , δ ) else (ηr, δr) = (ηold, δold) Samples are from π(η, δ, zsim| S(zsim) − S(z) < δ). Umberto Picchini (umberto@maths.lth.se)
by using a (not too!) small threshold δ we might obtain a decent acceptance rate for the approximated posterior... however ABC does not save us from having to produce computationally costly “long” trajectories zsim (n ≈ 25, 000) at each step of ABC-MCMC. However in ABC the relevant info about our simulations is encoded into S(·)... do we really have to simulate zsim having same length as our data z?? ...simulate “short” zsim that still result qualitatively representative! (see next slide...) Umberto Picchini (umberto@maths.lth.se)
Top row: full dataset of 28,000 observations. Bottom row: every 30th observation is reported. 0 0.5 1 1.5 2 2.5 x 10 4 20 25 30 35 40 index 22 24 26 28 30 32 34 36 38 40 0 500 1000 1500 2000 2500 data 0 100 200 300 400 500 600 700 800 900 22 24 26 28 30 32 34 36 index 22 24 26 28 30 32 34 36 0 10 20 30 40 50 60 70 data Dataset is 30 times smaller but qualitative features are still there! Umberto Picchini (umberto@maths.lth.se)
Strategy for large datasets (Picchini-Forman ’14) We have a dataset z of about 25, 000 observations. prior to starting ABC-MCMC construct S(z) to contain: 1 the 15th-30th...-90th percentile of the marginal distribution of the full data. → to identify Gauss. mixture params µ1,µ2 etc 2 values of autocorrelation function of full data z at lags (60, 300, 600, ..., 2100). → to identify dynamics-related params θ,γ, κ during ABC-MCMC we simulate shorter trajectories zsim of size 25000/30 ≈ 800. we take as summary statistics S(zsim) the 15th-30th...-90th percentile of simulated data and autocorrelations at lags (2, 10, 20, ..., 70) (recall zsim is 30x shorter than z). we then compare S(zsim) with S(z) into ABC-MCMC. this is fast! S(·) for the large data can be computed prior to ABC-MCMC start. Umberto Picchini (umberto@maths.lth.se)
Strategy for large datasets (Picchini-Forman ’14) We have a dataset z of about 25, 000 observations. prior to starting ABC-MCMC construct S(z) to contain: 1 the 15th-30th...-90th percentile of the marginal distribution of the full data. → to identify Gauss. mixture params µ1,µ2 etc 2 values of autocorrelation function of full data z at lags (60, 300, 600, ..., 2100). → to identify dynamics-related params θ,γ, κ during ABC-MCMC we simulate shorter trajectories zsim of size 25000/30 ≈ 800. we take as summary statistics S(zsim) the 15th-30th...-90th percentile of simulated data and autocorrelations at lags (2, 10, 20, ..., 70) (recall zsim is 30x shorter than z). we then compare S(zsim) with S(z) into ABC-MCMC. this is fast! S(·) for the large data can be computed prior to ABC-MCMC start. Umberto Picchini (umberto@maths.lth.se)
Strategy for large datasets (Picchini-Forman ’14) We have a dataset z of about 25, 000 observations. prior to starting ABC-MCMC construct S(z) to contain: 1 the 15th-30th...-90th percentile of the marginal distribution of the full data. → to identify Gauss. mixture params µ1,µ2 etc 2 values of autocorrelation function of full data z at lags (60, 300, 600, ..., 2100). → to identify dynamics-related params θ,γ, κ during ABC-MCMC we simulate shorter trajectories zsim of size 25000/30 ≈ 800. we take as summary statistics S(zsim) the 15th-30th...-90th percentile of simulated data and autocorrelations at lags (2, 10, 20, ..., 70) (recall zsim is 30x shorter than z). we then compare S(zsim) with S(z) into ABC-MCMC. this is fast! S(·) for the large data can be computed prior to ABC-MCMC start. Umberto Picchini (umberto@maths.lth.se)
So the first strategy to accelerate computations was simulating a smaller subset of artificial data. Our second strategy is to perform so called “early rejection” of proposed parameters. Umberto Picchini (umberto@maths.lth.se)
ABC-MCMC acceptance ratio: accept η if ω < min 1, π(η )K(|S(zsim) − S(z)|)u(ηold|η ) π(ηold)K(|S(zold) − S(z)|)u(η |ηold) however remember K(·) is a 0/1 kernel → let’s start the algorithm at an admissible ηstart such that at first iteration S(zold) ≈ S(z) → accept η if ω < min 1, π(η ) π(ηold) K(|S(zsim) − S(z)|) u(ηold|η ) u(η |ηold) Notice η will surely be rejected if ω > π(η ) π(ηold) u(ηold|η ) u(η |ηold) REGARDLESS the value of K(·) ∈ {0, 1} → do NOT simulate trajectories if the above is satisfied! Umberto Picchini (umberto@maths.lth.se)
Algorithm 3 Early–Rejection ABC-MCMC (Picchini ’13) 1. At (r + 1)th ABC-MCMC iteration: 2. generate η ∼ u(η|ηr) from its proposal distribution; 3. generate ω ∼ U(0, 1); if ω > π(η )u(ηr|η ) π(ηr)u(η |ηr) (= “ratio”) then (ηr+1, S(zsim,r+1)) := (ηr, S(zsim,r)); (proposal early-rejected) else generate xsim ∼ π(x|η ) conditionally on the η from step 2; determine ysim = τ(xsim) and generate zsim ∼ π(z|ysim, η ) and calculate S(zsim); if K(|S(zsim) − S(z)|) = 0 then (ηr+1, S(zsim,r+1)) := (ηr, S(zsim,r)) (proposal rejected) else if ω ratio then (ηr+1, S(zsim,r+1)) := (η , S(zsim)) (proposal accepted) else (ηr+1, S(zsim,r+1)) := (ηr, S(zsim,r)) (proposal rejected) end if end if 4. increment r to r + 1. If r > R stop, else go to step 2. Umberto Picchini (umberto@maths.lth.se)
Notice “early rejection” works only with 0-1 kernels. No reason not to use it. Early-rejection per-se is not an approximation, it’s just a trick. It saved us between 40-50% of computing time (Picchini ’13). Umberto Picchini (umberto@maths.lth.se)
Results after 2 millions ABC-MCMC iterations. Acceptance rate of 1% and 6 hrs computation with MATLAB on a common pc desktop. Table: Protein folding data experiment: posterior means from the ABC-MCMC output and 95% posterior intervals. ABC posterior means log θ –6.454 [–6.898,-5.909] log κ –0.651 [–1.424,0.246] log γ 0.071 [–0.313,0.378] log µ1 3.24 [3.22,3.26] log µ2 3.43 [3.39,3.45] log σ1 –0.959 [–2.45,0.38] log σ2 –0.424 [–2.26,0.76] log α –0.663 [–1.035,–0.383] Umberto Picchini (umberto@maths.lth.se)
0 0.5 1 1.5 2 2.5 x 10 4 20 30 40 0 0.5 1 1.5 2 2.5 x 10 4 20 30 40 0 0.5 1 1.5 2 2.5 x 10 4 20 30 40 Figure: Data (top), process ˆYt (middle), process ˆZt (bottom). ˆZt = ˆYt + ˆUt meaning “evaluated at ˆη = posterior mean” Umberto Picchini (umberto@maths.lth.se)
A simulation study (Picchini-Forman ’14) Here we want to compare ABC against the (computationally intensive) exact Bayesian inference (via particle MCMC, pMCMC). in order to do so we consider a very small dataset of 360 simulated observations. we use a parallel strategy for pMCMC devised in Dovrandi ’14 (4 chains run in parallel using 100 particles for each chain). C. Dovrandi (2014) Pseudo-marginal algorithms with multiple CPUs. Queensland University of Technology, http://eprints.qut.edu.au/61505/ U.P. and Forman (2014). Accelerating inference for diffusions observed with measureme and large sample sizes using Approximate Bayesian Computation. arXiv:1310.0973. Umberto Picchini (umberto@maths.lth.se)
Comparison ABC-MCMC (*) vs exact Bayes (pMCMC) True value θ 0.0027 0.0023 [0.0013,0.0041] 0.0024∗ [0.0013,0.0039] κ 0.538 0.444 [0.349,0.558] 0.553∗ [0.386,0.843] γ 1.063 1.040 [0.943,1.158] 0.982∗ [0.701,1.209] µ1 25.52 25.68 [25.08,26.61] 25.72∗ [25.10,26.71] µ2 30.92 32.12 [29.15,35.42] 32.17∗ [29.46,34.96] σ1 0.540 0.421 [0.203,0.844] 0.523∗ [0.248,0.972] σ2 0.624 0.502 [0.232,1.086] 0.511∗ [0.249,1.041] α 0.537 0.510 [0.345,0.755] 0.508∗ [0.346,0.721] Umberto Picchini (umberto@maths.lth.se)
−7 −6.5 −6 −5.5 −5 0 0.5 1 1.5 2 2.5 (a) log θ −2 −1.5 −1 −0.5 0 0.5 1 0 0.2 0.4 0.6 0.8 1 1.2 1.4 (b) log κ −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0 1 2 3 4 5 6 7 8 (c) log γ Figure: Exact Bayesian (solid); ABC-MCMC (dashed); true value (vertical lines); uniform priors. Umberto Picchini (umberto@maths.lth.se)
3.1 3.15 3.2 3.25 3.3 0 20 40 60 80 100 120 (a) log µ1 3.3 3.4 3.5 3.6 3.7 0 20 40 60 80 100 120 140 (b) log µ2 −3 −2 −1 0 1 0 0.5 1 1.5 2 2.5 3 (c) log σ1 −3 −2 −1 0 1 0 0.2 0.4 0.6 0.8 1 1.2 1.4 (d) log σ2 −1.5 −1 −0.5 0 0 0.5 1 1.5 2 2.5 3 3.5 (e) log α Figure: Exact Bayesian (solid); ABC-MCMC (dashed); true value (vertical lines); uniform priors.Umberto Picchini (umberto@maths.lth.se)
Conclusions as long as we manage to “compress” information into summary statistics ABC is a useful inferencial tool for complex models and large datasets. 1,000 ABC-MCMC iterations performed in 6 sec; in about 20 min with exact Bayesian sampling (pMCMC). ...problem is that ABC requires lots of tuning (choose S(·), δ, K(·)...) MATLAB implementation available at http://sourceforge.net/projects/abc-sde/ with 50+ pages manual. Umberto Picchini (umberto@maths.lth.se)
References U.P. (2014). Inference for SDE models via Approximate Bayesian Computation. J. Comp. Graph. Stat. U.P. and J. Forman (2013). Accelerating inference for diffusions observed with measurement error and large sample sizes using Approximate Bayesian Computation. arXiv:1310.0973. U.P. (2013) abc-sde: a MATLAB toolbox for approximate Bayesian computation (ABC) in stochastic differential equation models. http://sourceforge.net/projects/abc-sde/ Umberto Picchini (umberto@maths.lth.se)
Appendix Umberto Picchini (umberto@maths.lth.se)
Proof that the basic ABC algorithm works The proof is straightforward. We know that a draw (η , zsim) produced by the algorithm is such that (i) η ∼ π(η), and (ii) such that zsim = z, where zsim ∼ π(zsim | η ). Thus let’s call f(η ) the (unknown) density for such η , then because of (i) and (ii) f(η ) ∝ zsim π(η )π(zsim|η )Iz(zsim) = zsim=z π(η , zsim) ∝ π(η|z). Therefore η ∼ π(η|z). Umberto Picchini (umberto@maths.lth.se)
A theoretical motivation to consider ABC An important (known) result A fundamental consequence is that if S(·) is a sufficient statistic for θ then limδ→0 πδ(θ | y) = π(θ | y) the exact (marginal) posterior!!! uh?! Otherwise (in general) the algorithm draws from the approximation π(θ | ρ(S(x), S(y)) < δ). also by introducing the class of quadratic losses L(θ0, ˆθ; A) = (θ0 − ˆθ)T A(θ0 − ˆθ) : Another relevant result If S(y) = E(θ | y) then the minimal expected quadratic loss E(L(θ0, ˆθ; A) | y) is achieved via ˆθ = EABC(θ | S(y)) as δ → 0. Umberto Picchini (umberto@maths.lth.se)
The straightforward motivation is the following: consider the (ABC) posterior πδ(θ | y) then πδ(θ | y) = πδ(θ, x | y)dx ∝ π(θ) 1 δ K |S(x) − S(y)| δ π(x | θ)dx → π(θ)π(S(x) = S(y) | θ) (δ → 0). Therefore if S(·) is a sufficient statistic for θ then lim δ→0 πδ(θ | y) = π(θ | y) the exact (marginal) posterior!!! Umberto Picchini (umberto@maths.lth.se)
Acceptance probability in Metropolis-Hastings Suppose at a given iteration of Metropolis-Hastings we are in the (augmented)-state position (θ#, x#) and wonder whether to move (or not) to a new state (θ , x ). The move is generated via a proposal distribution “q((θ#, x#) → (x , θ ))”. e.g. “q((θ#, x#) → (x , θ ))” = u(θ |θ#)v(x | θ ); move “(θ#, x#) → (θ , x )” accepted with probability α(θ#,x#)→(x ,θ ) = min 1, π(θ )π(x |θ )π(y|x , θ )q((θ , x ) → (θ#, x#)) π(θ#)π(x#|θ#)π(y|x#, θ#)q((θ#, x#) → (θ , x )) = min 1, π(θ )π(x |θ )π(y|x , θ )u(θ#|θ )v(x# | θ#) π(θ#)π(x#|θ#)π(y|x#, θ#)u(θ |θ#)v(x | θ ) now choose v(x | θ) ≡ π(x | θ) = min 1, π(θ )π(x |θ )π(y|x , θ )u(θ#|θ ) π(x# | θ#) π(θ#)π(x#|θ#)π(y|x#, θ#)u(θ |θ#) π(x | θ ) This is likelihood–free! And we only need to know how to generate xUmberto Picchini (umberto@maths.lth.se)
Acceptance probability in Metropolis-Hastings Suppose at a given iteration of Metropolis-Hastings we are in the (augmented)-state position (θ#, x#) and wonder whether to move (or not) to a new state (θ , x ). The move is generated via a proposal distribution “q((θ#, x#) → (x , θ ))”. e.g. “q((θ#, x#) → (x , θ ))” = u(θ |θ#)v(x | θ ); move “(θ#, x#) → (θ , x )” accepted with probability α(θ#,x#)→(x ,θ ) = min 1, π(θ )π(x |θ )π(y|x , θ )q((θ , x ) → (θ#, x#)) π(θ#)π(x#|θ#)π(y|x#, θ#)q((θ#, x#) → (θ , x )) = min 1, π(θ )π(x |θ )π(y|x , θ )u(θ#|θ )v(x# | θ#) π(θ#)π(x#|θ#)π(y|x#, θ#)u(θ |θ#)v(x | θ ) now choose v(x | θ) ≡ π(x | θ) = min 1, π(θ )π(x |θ )π(y|x , θ )u(θ#|θ ) π(x# | θ#) π(θ#)π(x#|θ#)π(y|x#, θ#)u(θ |θ#) π(x | θ ) This is likelihood–free! And we only need to know how to generate xUmberto Picchini (umberto@maths.lth.se)
Acceptance probability in Metropolis-Hastings Suppose at a given iteration of Metropolis-Hastings we are in the (augmented)-state position (θ#, x#) and wonder whether to move (or not) to a new state (θ , x ). The move is generated via a proposal distribution “q((θ#, x#) → (x , θ ))”. e.g. “q((θ#, x#) → (x , θ ))” = u(θ |θ#)v(x | θ ); move “(θ#, x#) → (θ , x )” accepted with probability α(θ#,x#)→(x ,θ ) = min 1, π(θ )π(x |θ )π(y|x , θ )q((θ , x ) → (θ#, x#)) π(θ#)π(x#|θ#)π(y|x#, θ#)q((θ#, x#) → (θ , x )) = min 1, π(θ )π(x |θ )π(y|x , θ )u(θ#|θ )v(x# | θ#) π(θ#)π(x#|θ#)π(y|x#, θ#)u(θ |θ#)v(x | θ ) now choose v(x | θ) ≡ π(x | θ) = min 1, π(θ )π(x |θ )π(y|x , θ )u(θ#|θ ) π(x# | θ#) π(θ#)π(x#|θ#)π(y|x#, θ#)u(θ |θ#) π(x | θ ) This is likelihood–free! And we only need to know how to generate xUmberto Picchini (umberto@maths.lth.se)
Generation of δ’s 0 0.5 1 1.5 2 2.5 x 10 5 0 0.2 0.4 0.6 0.8 Here we generate a chain for log δ using a (truncated) Gaussian random walk with support (−∞, log δmax]. We let log δmax decrease during the simulation. Umberto Picchini (umberto@maths.lth.se)
HOWTO: post-hoc selection of δ (the “precision” parameter) [Bortot et al. 2007] During ABC-MCMC we let δ vary (according to a MRW): at rth iteration δr = δr−1 + ∆, with ∆ ∼ N(0, ν2 ). After the end of the MCMC we have a sequence {θr, δr}r=0,1,2... and for each parameter {θj,r}r=0,1,2... we produce a plot of the parameter chain vs δ: 0 0.5 1 1.5 2 2.5 3 3.5 4 −2.5 −2 −1.5 −1 −0.5 0 bandwidth Umberto Picchini (umberto@maths.lth.se)
Post-hoc selection of the bandwidth δ, cont’d... Therefore in practice: we filter out of the analyses those draws {θr}r=0,1,2,... corresponding to “large” δ, for statistical precision; we retain only those {θr}r=0,1,2,... corresponding to a low δ. in the example we retain {θr; δr 1.5}. PRO: this is useful as it allows an ex-post selection of δ, i.e. we do not need to know in advance a suitable value for δ. CON: by filtering out some of the draws, a disadvantage of the approach is the need to run very long MCMC simulations in order to have enough “material” on which to base our posterior inference. PRO: also notice that by letting δ vary we are almost considering a global optimization method (similar to simulated tempering). Umberto Picchini (umberto@maths.lth.se)

Accelerated approximate Bayesian computation with applications to protein folding data

  • 1.
    Accelerating inference forcomplex stochastic models using Approximate Bayesian Computation with an application to protein folding Umberto Picchini Centre for Mathematical Sciences, Lund University joint work with Julie Forman (Dept. Biostatistics, Copenhagen University) Dept. Mathematics, Uppsala 4 Sept. 2014 Umberto Picchini (umberto@maths.lth.se)
  • 2.
    Outline I’ll use proteinfolding data as a motivating example; most of the talk will be about statistical inference issues. I will introduce our data and protein folding problem. I will introduce a model based on stochastic differential equations. I will mention some theoretical and computational issues related to (Bayesian) inference. I will introduce approximate Bayesian computation (ABC) to alleviate methodological and computational problems. Umberto Picchini (umberto@maths.lth.se)
  • 3.
    A motivating example Proteinsare synthesized in the cell on ribosomes as linear, unstructured polymers ...which then self-assemble into specific and functional three-dimensional structures. Umberto Picchini (umberto@maths.lth.se)
  • 4.
    This self-assembly processis called protein folding. It’s the last and crucial step in the transformation of genetic information, encoded in DNA, into functional protein molecules. Umberto Picchini (umberto@maths.lth.se)
  • 5.
    protein folding isalso associated with a wide range of human diseases. In many neurodegenerative diseases, such as Alzheimers disease, proteins misfold into toxic protein structures. Protein folding has been named “the Holy Grail of biochemistry and biophysics” (!). Umberto Picchini (umberto@maths.lth.se)
  • 6.
    Modelize time dynamicsis difficult (large number of atoms in a 3D space); Atoms coordinates are usually projected onto a single dimension called reaction coordinate see the figure below 0 0.5 1 1.5 2 2.5 x 10 4 20 25 30 35 40 index Figure: Data time-course projected on a single coordinate: 25,000 measurements of the L-reaction coordinate of the small Trp-zipper protein at sampling freq. ∆−1 = 1/nsec. here the L-reaction coordinate was used, i.e. the total distance to a folded reference. notice the random switching between folded/unfolded states. Umberto Picchini (umberto@maths.lth.se)
  • 7.
    Modelize time dynamicsis difficult (large number of atoms in a 3D space); Atoms coordinates are usually projected onto a single dimension called reaction coordinate see the figure below 0 0.5 1 1.5 2 2.5 x 10 4 20 25 30 35 40 index Figure: Data time-course projected on a single coordinate: 25,000 measurements of the L-reaction coordinate of the small Trp-zipper protein at sampling freq. ∆−1 = 1/nsec. here the L-reaction coordinate was used, i.e. the total distance to a folded reference. notice the random switching between folded/unfolded states. Umberto Picchini (umberto@maths.lth.se)
  • 8.
    Forman and Sørensen[2013] proposed to consider sums of diffusions:1 Zt observable process = Yt latent state + Ut autocorrelated error term they considered diffusion processes to modelize both Yt and Ut they found i.i.d. errors where not really giving satisfactory results. So let’s introduce some autocorrelation: dUt = −κUtdt + 2κγ2dWt, U0 = 0 a zero mean Ornstein-Uhlenbeck process with stationary variance γ2 and autocorrelation ρU(t) = e−κt. Here dWt ∼ N(0, dt). 1 Forman and Sørensen, A transformation approach to modelling multi-modal diffusions. J. Statistical Planning and Inference. 2014. Umberto Picchini (umberto@maths.lth.se)
  • 9.
    Regarding the “‘signal”part Yt: data clearly shows a bimodal marginal structure: 0 0.5 1 1.5 2 2.5 x 10 4 20 25 30 35 40 index 22 24 26 28 30 32 34 36 38 40 0 500 1000 1500 2000 2500 data so we want a stochastic process that is able to switch between the two “modes” (i.e. the folded-unfolded states). One of possible options: an OU Xt with zero mean and unit variance dXt = −θXtdt + √ 2θdBt, X0 = x0 plug each Xt into the cdf of its stationary N(0,1) distribution ⇒ take Φ(Xt) as the cdf of N(0, 1). Umberto Picchini (umberto@maths.lth.se)
  • 10.
    now build aGaussian mixture and take the percentile corresponding to an area Φ(Xt) to summarize: simulate Xt ⇒ compute Φ(Xt) ⇒ find percentile Yt from a 2-components Gaussian mixture with cdf F(y) = α · Φ y − µ1 σ1 + (1 − α) · Φ y − µ2 σ2 Umberto Picchini (umberto@maths.lth.se)
  • 11.
    So in conclusionwe have: Zt data = Yt latent state + Ut autocorrelated error term    Zt = Yt + Ut Yt := τ(Xt) dXt = −θXtdt + √ 2θdBt dUt = −κUtdt + 2κγ2dWt τ(x) = (F−1 ◦Φ)(x), F(y) = α·Φ y − µ1 σ1 +(1−α)·Φ y − µ2 σ2 We are interested in conducting (Bayesian) inference for η = (θ, κ, γ, α, µ1, µ2, σ1, σ2) Umberto Picchini (umberto@maths.lth.se)
  • 12.
    Difficulties with exactBayesian A non-exhaustive list on the difficulty of using exact Bayesian inference via MCMC and SMC in our application: our dataset is “large” (25,000 observations...) which is not “terribly” large in absolute sense, but it is when dealing with diffusion processes... ...in fact even when a proposed parameter value is in the bulk of the posterior distribution, generated trajectories might still be too distant from the data (⇒ high rejection rate!) a high rejection rate implies poor exploration of the posterior surface, poor inferential results and increasing computational time. some of these issues can be mitigated using bridging techniques (Beskos et al. ’13): not trivial in our case (transformation τ(x) unknown in closed form). Umberto Picchini (umberto@maths.lth.se)
  • 13.
    Before going intoapproximated methods...you should trust (!) that we have put lots of effort in trying to avoid approximations! Still... Beside theoretical difficulties, currently existing methods do not scale well for large data: e.g. we attempted at using particle MCMC (Andrieu et al. ’10) and even with a few (10 only!) particles it would require weeks to obtain results... it is expensive to simulate from our model as percentiles of mixture-models are unknown in closed form. we had to use some approximated strategy, and we considered ABC (approximate Bayesian computation). Umberto Picchini (umberto@maths.lth.se)
  • 14.
    Notation Data: z =(z0, z1, ..., zn) unknown parameters: η = (θ, κ, γ, α, µ1, µ2, σ1, σ2) Likelihood funct.: p(z|η) Prior density: π(η) (our a priori knowledge of η) Posterior density: π(η|z) ∝ π(η)p(z|η) Ideally we would like to use/sample from the posterior. We assume this is either theoretically difficult or computationally expensive (it is in our case!). Umberto Picchini (umberto@maths.lth.se)
  • 15.
    Approximate Bayesian computation(ABC) ABC gives a way to approximate a posterior distribution π(η|z) key to the success of ABC is the ability to bypass the explicit calculation of the likelihood p(z|η) ...only forward-simulation from the model is required! ABC is in fact a likelihood-free method that works by simulating pseudo-data zsim from the model: zsim ∼ p(z|η) had an incredible success in genetic studies since mid 90’s (Tavare et al ’97, Pritchard et al. ’99). lots of hype in recent years: see Christian Robert’s excellent blog. Umberto Picchini (umberto@maths.lth.se)
  • 16.
    Basic rejection sampler(NO approximations here!) for r = 1 to R do repeat Generate parameter η from its prior distribution π(η) Generate zsim from the likelihood p(z|η ) (!! no need to know p(·) analytically!!) until zsim = z (simulated data = actual data) set ηr = η end for The algorithm produces R samples from the exact posterior π(η|z). :( It won’t work for continuous data or large amount of data because Pr(z = zsim) ≈ 0 ⇒ substitute z = zsim with z ≈ zsim Umberto Picchini (umberto@maths.lth.se)
  • 17.
    ...⇒ substitute z= zsim with z ≈ zsim Introduce some distance z − zsim to measure proximity of zsim and data z [Pritchard et al. 1999]. Introduce a tolerance value δ > 0. An ABC rejection sampler: for r = 1 to R do repeat Generate parameter η from its prior distribution π(η) Generate zsim from the likelihood p(z|η ) until z − zsim < δ [or alternatively S(z) − S(zsim) < δ] set ηr = η end for for some “summary statistics” S(·). Umberto Picchini (umberto@maths.lth.se)
  • 18.
    Previous algorithm samplesfrom approximate posteriors π(η| z − zsim < δ) or π(η| S(z) − S(zsim) < δ. useful to consider statistics S(·) when dealing with large datasets to increase the probability of acceptance. the key result of ABC When S(·) is “sufficient” for η and δ ≈ 0 sampling from the posterior is (almost) exact! When S(·) is sufficient for the parameter ⇒ π(η| S(zsim) − S(z) < δ) ≡ π(η| zsim − z < δ) when δ = 0 and S sufficient we accept only parameter draws for which zsim ≡ z ⇒ π(η|z), the exact posterior. This is all good and nice, but such conditions rarely hold. Umberto Picchini (umberto@maths.lth.se)
  • 19.
    Previous algorithm samplesfrom approximate posteriors π(η| z − zsim < δ) or π(η| S(z) − S(zsim) < δ. useful to consider statistics S(·) when dealing with large datasets to increase the probability of acceptance. the key result of ABC When S(·) is “sufficient” for η and δ ≈ 0 sampling from the posterior is (almost) exact! When S(·) is sufficient for the parameter ⇒ π(η| S(zsim) − S(z) < δ) ≡ π(η| zsim − z < δ) when δ = 0 and S sufficient we accept only parameter draws for which zsim ≡ z ⇒ π(η|z), the exact posterior. This is all good and nice, but such conditions rarely hold. Umberto Picchini (umberto@maths.lth.se)
  • 20.
    a central problemis how to choose the statistics S(·): outside the exponential family we typically cannot derive sufficient statistics. [A key work to obtain “semi-automatically” statistics is Fearnhead-Prangle ’12 (discussion paper on JRSS-B. Very much recommended.)] substitute with the loose concept of informative (enough) statistic then choose a small (enough) threshold δ. We now go back to our model and (large) data. We propose some trick to accelerate the inference. We will use an ABC within MCMC approach (ABC-MCMC). Umberto Picchini (umberto@maths.lth.se)
  • 21.
    a central problemis how to choose the statistics S(·): outside the exponential family we typically cannot derive sufficient statistics. [A key work to obtain “semi-automatically” statistics is Fearnhead-Prangle ’12 (discussion paper on JRSS-B. Very much recommended.)] substitute with the loose concept of informative (enough) statistic then choose a small (enough) threshold δ. We now go back to our model and (large) data. We propose some trick to accelerate the inference. We will use an ABC within MCMC approach (ABC-MCMC). Umberto Picchini (umberto@maths.lth.se)
  • 22.
    ABC-MCMC Example: Weigh discrepancybetween observed data and simulated trajectories using a uniform 0-1 kernel Kδ(·), e.g. Kδ(S(zsim), S(z)) = 1 if S(zsim) − S(z) < δ 0 otherwise complete freedom to choose a different criterion... Use such measure in place of the typical conditional density of data given latent states. See next slide. Umberto Picchini (umberto@maths.lth.se)
  • 23.
       Zt = Yt+ Ut with Yt := τ(Xt) dXt = −θXtdt + √ 2θdBt dUt = −κUtdt + 2κγ2dWt ABC-MCMC acceptance ratio: Given the current value of parameter η ≡ ηold generate a Markov chain via Metropolis-Hastings: Algorithm 1 a generic iteration of ABC-MCMC (fixed threshold δ) At r-th iteration 1. generate η ∼ u(η |ηold), e.g. using Gaussian random walk 2. generate zsim|η ∼ “from the model – forward simulation” 3. generate ω ∼ U(0, 1) 4. accept η if ω < min 1, π(η )K(S(zsim),S(z))u(ηold|η ) π(ηold)K(S(zold),S(z))u(η |ηold) then set ηr = η else ηr = ηold Samples are from π(η, zsim| S(zsim) − S(z) < δ). Umberto Picchini (umberto@maths.lth.se)
  • 24.
    Algorithm 2 ageneric iteration of ABC-MCMC (random threshold δ) At r-th iteration 1. generate η ∼ u(η |ηold), δ ∼ v(δ|δold) 2. generate zsim|η ∼ “from the model – forward simulation” 3. generate ω ∼ U(0, 1) 4. accept η if ω < min 1, π(η )Kδ(S(zsim),S(z))u(ηold|η )v(δold|δ ) π(ηold)Kδ(S(zold),S(z))u(η |ηold)v(δ |δold) then set (ηr, δr) = (η , δ ) else (ηr, δr) = (ηold, δold) Samples are from π(η, δ, zsim| S(zsim) − S(z) < δ). Umberto Picchini (umberto@maths.lth.se)
  • 25.
    by using a(not too!) small threshold δ we might obtain a decent acceptance rate for the approximated posterior... however ABC does not save us from having to produce computationally costly “long” trajectories zsim (n ≈ 25, 000) at each step of ABC-MCMC. However in ABC the relevant info about our simulations is encoded into S(·)... do we really have to simulate zsim having same length as our data z?? ...simulate “short” zsim that still result qualitatively representative! (see next slide...) Umberto Picchini (umberto@maths.lth.se)
  • 26.
    Top row: fulldataset of 28,000 observations. Bottom row: every 30th observation is reported. 0 0.5 1 1.5 2 2.5 x 10 4 20 25 30 35 40 index 22 24 26 28 30 32 34 36 38 40 0 500 1000 1500 2000 2500 data 0 100 200 300 400 500 600 700 800 900 22 24 26 28 30 32 34 36 index 22 24 26 28 30 32 34 36 0 10 20 30 40 50 60 70 data Dataset is 30 times smaller but qualitative features are still there! Umberto Picchini (umberto@maths.lth.se)
  • 27.
    Strategy for largedatasets (Picchini-Forman ’14) We have a dataset z of about 25, 000 observations. prior to starting ABC-MCMC construct S(z) to contain: 1 the 15th-30th...-90th percentile of the marginal distribution of the full data. → to identify Gauss. mixture params µ1,µ2 etc 2 values of autocorrelation function of full data z at lags (60, 300, 600, ..., 2100). → to identify dynamics-related params θ,γ, κ during ABC-MCMC we simulate shorter trajectories zsim of size 25000/30 ≈ 800. we take as summary statistics S(zsim) the 15th-30th...-90th percentile of simulated data and autocorrelations at lags (2, 10, 20, ..., 70) (recall zsim is 30x shorter than z). we then compare S(zsim) with S(z) into ABC-MCMC. this is fast! S(·) for the large data can be computed prior to ABC-MCMC start. Umberto Picchini (umberto@maths.lth.se)
  • 28.
    Strategy for largedatasets (Picchini-Forman ’14) We have a dataset z of about 25, 000 observations. prior to starting ABC-MCMC construct S(z) to contain: 1 the 15th-30th...-90th percentile of the marginal distribution of the full data. → to identify Gauss. mixture params µ1,µ2 etc 2 values of autocorrelation function of full data z at lags (60, 300, 600, ..., 2100). → to identify dynamics-related params θ,γ, κ during ABC-MCMC we simulate shorter trajectories zsim of size 25000/30 ≈ 800. we take as summary statistics S(zsim) the 15th-30th...-90th percentile of simulated data and autocorrelations at lags (2, 10, 20, ..., 70) (recall zsim is 30x shorter than z). we then compare S(zsim) with S(z) into ABC-MCMC. this is fast! S(·) for the large data can be computed prior to ABC-MCMC start. Umberto Picchini (umberto@maths.lth.se)
  • 29.
    Strategy for largedatasets (Picchini-Forman ’14) We have a dataset z of about 25, 000 observations. prior to starting ABC-MCMC construct S(z) to contain: 1 the 15th-30th...-90th percentile of the marginal distribution of the full data. → to identify Gauss. mixture params µ1,µ2 etc 2 values of autocorrelation function of full data z at lags (60, 300, 600, ..., 2100). → to identify dynamics-related params θ,γ, κ during ABC-MCMC we simulate shorter trajectories zsim of size 25000/30 ≈ 800. we take as summary statistics S(zsim) the 15th-30th...-90th percentile of simulated data and autocorrelations at lags (2, 10, 20, ..., 70) (recall zsim is 30x shorter than z). we then compare S(zsim) with S(z) into ABC-MCMC. this is fast! S(·) for the large data can be computed prior to ABC-MCMC start. Umberto Picchini (umberto@maths.lth.se)
  • 30.
    So the firststrategy to accelerate computations was simulating a smaller subset of artificial data. Our second strategy is to perform so called “early rejection” of proposed parameters. Umberto Picchini (umberto@maths.lth.se)
  • 31.
    ABC-MCMC acceptance ratio: acceptη if ω < min 1, π(η )K(|S(zsim) − S(z)|)u(ηold|η ) π(ηold)K(|S(zold) − S(z)|)u(η |ηold) however remember K(·) is a 0/1 kernel → let’s start the algorithm at an admissible ηstart such that at first iteration S(zold) ≈ S(z) → accept η if ω < min 1, π(η ) π(ηold) K(|S(zsim) − S(z)|) u(ηold|η ) u(η |ηold) Notice η will surely be rejected if ω > π(η ) π(ηold) u(ηold|η ) u(η |ηold) REGARDLESS the value of K(·) ∈ {0, 1} → do NOT simulate trajectories if the above is satisfied! Umberto Picchini (umberto@maths.lth.se)
  • 32.
    Algorithm 3 Early–RejectionABC-MCMC (Picchini ’13) 1. At (r + 1)th ABC-MCMC iteration: 2. generate η ∼ u(η|ηr) from its proposal distribution; 3. generate ω ∼ U(0, 1); if ω > π(η )u(ηr|η ) π(ηr)u(η |ηr) (= “ratio”) then (ηr+1, S(zsim,r+1)) := (ηr, S(zsim,r)); (proposal early-rejected) else generate xsim ∼ π(x|η ) conditionally on the η from step 2; determine ysim = τ(xsim) and generate zsim ∼ π(z|ysim, η ) and calculate S(zsim); if K(|S(zsim) − S(z)|) = 0 then (ηr+1, S(zsim,r+1)) := (ηr, S(zsim,r)) (proposal rejected) else if ω ratio then (ηr+1, S(zsim,r+1)) := (η , S(zsim)) (proposal accepted) else (ηr+1, S(zsim,r+1)) := (ηr, S(zsim,r)) (proposal rejected) end if end if 4. increment r to r + 1. If r > R stop, else go to step 2. Umberto Picchini (umberto@maths.lth.se)
  • 33.
    Notice “early rejection”works only with 0-1 kernels. No reason not to use it. Early-rejection per-se is not an approximation, it’s just a trick. It saved us between 40-50% of computing time (Picchini ’13). Umberto Picchini (umberto@maths.lth.se)
  • 34.
    Results after 2millions ABC-MCMC iterations. Acceptance rate of 1% and 6 hrs computation with MATLAB on a common pc desktop. Table: Protein folding data experiment: posterior means from the ABC-MCMC output and 95% posterior intervals. ABC posterior means log θ –6.454 [–6.898,-5.909] log κ –0.651 [–1.424,0.246] log γ 0.071 [–0.313,0.378] log µ1 3.24 [3.22,3.26] log µ2 3.43 [3.39,3.45] log σ1 –0.959 [–2.45,0.38] log σ2 –0.424 [–2.26,0.76] log α –0.663 [–1.035,–0.383] Umberto Picchini (umberto@maths.lth.se)
  • 35.
    0 0.5 11.5 2 2.5 x 10 4 20 30 40 0 0.5 1 1.5 2 2.5 x 10 4 20 30 40 0 0.5 1 1.5 2 2.5 x 10 4 20 30 40 Figure: Data (top), process ˆYt (middle), process ˆZt (bottom). ˆZt = ˆYt + ˆUt meaning “evaluated at ˆη = posterior mean” Umberto Picchini (umberto@maths.lth.se)
  • 36.
    A simulation study(Picchini-Forman ’14) Here we want to compare ABC against the (computationally intensive) exact Bayesian inference (via particle MCMC, pMCMC). in order to do so we consider a very small dataset of 360 simulated observations. we use a parallel strategy for pMCMC devised in Dovrandi ’14 (4 chains run in parallel using 100 particles for each chain). C. Dovrandi (2014) Pseudo-marginal algorithms with multiple CPUs. Queensland University of Technology, http://eprints.qut.edu.au/61505/ U.P. and Forman (2014). Accelerating inference for diffusions observed with measureme and large sample sizes using Approximate Bayesian Computation. arXiv:1310.0973. Umberto Picchini (umberto@maths.lth.se)
  • 37.
    Comparison ABC-MCMC (*)vs exact Bayes (pMCMC) True value θ 0.0027 0.0023 [0.0013,0.0041] 0.0024∗ [0.0013,0.0039] κ 0.538 0.444 [0.349,0.558] 0.553∗ [0.386,0.843] γ 1.063 1.040 [0.943,1.158] 0.982∗ [0.701,1.209] µ1 25.52 25.68 [25.08,26.61] 25.72∗ [25.10,26.71] µ2 30.92 32.12 [29.15,35.42] 32.17∗ [29.46,34.96] σ1 0.540 0.421 [0.203,0.844] 0.523∗ [0.248,0.972] σ2 0.624 0.502 [0.232,1.086] 0.511∗ [0.249,1.041] α 0.537 0.510 [0.345,0.755] 0.508∗ [0.346,0.721] Umberto Picchini (umberto@maths.lth.se)
  • 38.
    −7 −6.5 −6−5.5 −5 0 0.5 1 1.5 2 2.5 (a) log θ −2 −1.5 −1 −0.5 0 0.5 1 0 0.2 0.4 0.6 0.8 1 1.2 1.4 (b) log κ −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0 1 2 3 4 5 6 7 8 (c) log γ Figure: Exact Bayesian (solid); ABC-MCMC (dashed); true value (vertical lines); uniform priors. Umberto Picchini (umberto@maths.lth.se)
  • 39.
    3.1 3.15 3.23.25 3.3 0 20 40 60 80 100 120 (a) log µ1 3.3 3.4 3.5 3.6 3.7 0 20 40 60 80 100 120 140 (b) log µ2 −3 −2 −1 0 1 0 0.5 1 1.5 2 2.5 3 (c) log σ1 −3 −2 −1 0 1 0 0.2 0.4 0.6 0.8 1 1.2 1.4 (d) log σ2 −1.5 −1 −0.5 0 0 0.5 1 1.5 2 2.5 3 3.5 (e) log α Figure: Exact Bayesian (solid); ABC-MCMC (dashed); true value (vertical lines); uniform priors.Umberto Picchini (umberto@maths.lth.se)
  • 40.
    Conclusions as long aswe manage to “compress” information into summary statistics ABC is a useful inferencial tool for complex models and large datasets. 1,000 ABC-MCMC iterations performed in 6 sec; in about 20 min with exact Bayesian sampling (pMCMC). ...problem is that ABC requires lots of tuning (choose S(·), δ, K(·)...) MATLAB implementation available at http://sourceforge.net/projects/abc-sde/ with 50+ pages manual. Umberto Picchini (umberto@maths.lth.se)
  • 41.
    References U.P. (2014). Inferencefor SDE models via Approximate Bayesian Computation. J. Comp. Graph. Stat. U.P. and J. Forman (2013). Accelerating inference for diffusions observed with measurement error and large sample sizes using Approximate Bayesian Computation. arXiv:1310.0973. U.P. (2013) abc-sde: a MATLAB toolbox for approximate Bayesian computation (ABC) in stochastic differential equation models. http://sourceforge.net/projects/abc-sde/ Umberto Picchini (umberto@maths.lth.se)
  • 42.
  • 43.
    Proof that thebasic ABC algorithm works The proof is straightforward. We know that a draw (η , zsim) produced by the algorithm is such that (i) η ∼ π(η), and (ii) such that zsim = z, where zsim ∼ π(zsim | η ). Thus let’s call f(η ) the (unknown) density for such η , then because of (i) and (ii) f(η ) ∝ zsim π(η )π(zsim|η )Iz(zsim) = zsim=z π(η , zsim) ∝ π(η|z). Therefore η ∼ π(η|z). Umberto Picchini (umberto@maths.lth.se)
  • 44.
    A theoretical motivationto consider ABC An important (known) result A fundamental consequence is that if S(·) is a sufficient statistic for θ then limδ→0 πδ(θ | y) = π(θ | y) the exact (marginal) posterior!!! uh?! Otherwise (in general) the algorithm draws from the approximation π(θ | ρ(S(x), S(y)) < δ). also by introducing the class of quadratic losses L(θ0, ˆθ; A) = (θ0 − ˆθ)T A(θ0 − ˆθ) : Another relevant result If S(y) = E(θ | y) then the minimal expected quadratic loss E(L(θ0, ˆθ; A) | y) is achieved via ˆθ = EABC(θ | S(y)) as δ → 0. Umberto Picchini (umberto@maths.lth.se)
  • 45.
    The straightforward motivationis the following: consider the (ABC) posterior πδ(θ | y) then πδ(θ | y) = πδ(θ, x | y)dx ∝ π(θ) 1 δ K |S(x) − S(y)| δ π(x | θ)dx → π(θ)π(S(x) = S(y) | θ) (δ → 0). Therefore if S(·) is a sufficient statistic for θ then lim δ→0 πδ(θ | y) = π(θ | y) the exact (marginal) posterior!!! Umberto Picchini (umberto@maths.lth.se)
  • 46.
    Acceptance probability inMetropolis-Hastings Suppose at a given iteration of Metropolis-Hastings we are in the (augmented)-state position (θ#, x#) and wonder whether to move (or not) to a new state (θ , x ). The move is generated via a proposal distribution “q((θ#, x#) → (x , θ ))”. e.g. “q((θ#, x#) → (x , θ ))” = u(θ |θ#)v(x | θ ); move “(θ#, x#) → (θ , x )” accepted with probability α(θ#,x#)→(x ,θ ) = min 1, π(θ )π(x |θ )π(y|x , θ )q((θ , x ) → (θ#, x#)) π(θ#)π(x#|θ#)π(y|x#, θ#)q((θ#, x#) → (θ , x )) = min 1, π(θ )π(x |θ )π(y|x , θ )u(θ#|θ )v(x# | θ#) π(θ#)π(x#|θ#)π(y|x#, θ#)u(θ |θ#)v(x | θ ) now choose v(x | θ) ≡ π(x | θ) = min 1, π(θ )π(x |θ )π(y|x , θ )u(θ#|θ ) π(x# | θ#) π(θ#)π(x#|θ#)π(y|x#, θ#)u(θ |θ#) π(x | θ ) This is likelihood–free! And we only need to know how to generate xUmberto Picchini (umberto@maths.lth.se)
  • 47.
    Acceptance probability inMetropolis-Hastings Suppose at a given iteration of Metropolis-Hastings we are in the (augmented)-state position (θ#, x#) and wonder whether to move (or not) to a new state (θ , x ). The move is generated via a proposal distribution “q((θ#, x#) → (x , θ ))”. e.g. “q((θ#, x#) → (x , θ ))” = u(θ |θ#)v(x | θ ); move “(θ#, x#) → (θ , x )” accepted with probability α(θ#,x#)→(x ,θ ) = min 1, π(θ )π(x |θ )π(y|x , θ )q((θ , x ) → (θ#, x#)) π(θ#)π(x#|θ#)π(y|x#, θ#)q((θ#, x#) → (θ , x )) = min 1, π(θ )π(x |θ )π(y|x , θ )u(θ#|θ )v(x# | θ#) π(θ#)π(x#|θ#)π(y|x#, θ#)u(θ |θ#)v(x | θ ) now choose v(x | θ) ≡ π(x | θ) = min 1, π(θ )π(x |θ )π(y|x , θ )u(θ#|θ ) π(x# | θ#) π(θ#)π(x#|θ#)π(y|x#, θ#)u(θ |θ#) π(x | θ ) This is likelihood–free! And we only need to know how to generate xUmberto Picchini (umberto@maths.lth.se)
  • 48.
    Acceptance probability inMetropolis-Hastings Suppose at a given iteration of Metropolis-Hastings we are in the (augmented)-state position (θ#, x#) and wonder whether to move (or not) to a new state (θ , x ). The move is generated via a proposal distribution “q((θ#, x#) → (x , θ ))”. e.g. “q((θ#, x#) → (x , θ ))” = u(θ |θ#)v(x | θ ); move “(θ#, x#) → (θ , x )” accepted with probability α(θ#,x#)→(x ,θ ) = min 1, π(θ )π(x |θ )π(y|x , θ )q((θ , x ) → (θ#, x#)) π(θ#)π(x#|θ#)π(y|x#, θ#)q((θ#, x#) → (θ , x )) = min 1, π(θ )π(x |θ )π(y|x , θ )u(θ#|θ )v(x# | θ#) π(θ#)π(x#|θ#)π(y|x#, θ#)u(θ |θ#)v(x | θ ) now choose v(x | θ) ≡ π(x | θ) = min 1, π(θ )π(x |θ )π(y|x , θ )u(θ#|θ ) π(x# | θ#) π(θ#)π(x#|θ#)π(y|x#, θ#)u(θ |θ#) π(x | θ ) This is likelihood–free! And we only need to know how to generate xUmberto Picchini (umberto@maths.lth.se)
  • 49.
    Generation of δ’s 00.5 1 1.5 2 2.5 x 10 5 0 0.2 0.4 0.6 0.8 Here we generate a chain for log δ using a (truncated) Gaussian random walk with support (−∞, log δmax]. We let log δmax decrease during the simulation. Umberto Picchini (umberto@maths.lth.se)
  • 50.
    HOWTO: post-hoc selectionof δ (the “precision” parameter) [Bortot et al. 2007] During ABC-MCMC we let δ vary (according to a MRW): at rth iteration δr = δr−1 + ∆, with ∆ ∼ N(0, ν2 ). After the end of the MCMC we have a sequence {θr, δr}r=0,1,2... and for each parameter {θj,r}r=0,1,2... we produce a plot of the parameter chain vs δ: 0 0.5 1 1.5 2 2.5 3 3.5 4 −2.5 −2 −1.5 −1 −0.5 0 bandwidth Umberto Picchini (umberto@maths.lth.se)
  • 51.
    Post-hoc selection ofthe bandwidth δ, cont’d... Therefore in practice: we filter out of the analyses those draws {θr}r=0,1,2,... corresponding to “large” δ, for statistical precision; we retain only those {θr}r=0,1,2,... corresponding to a low δ. in the example we retain {θr; δr 1.5}. PRO: this is useful as it allows an ex-post selection of δ, i.e. we do not need to know in advance a suitable value for δ. CON: by filtering out some of the draws, a disadvantage of the approach is the need to run very long MCMC simulations in order to have enough “material” on which to base our posterior inference. PRO: also notice that by letting δ vary we are almost considering a global optimization method (similar to simulated tempering). Umberto Picchini (umberto@maths.lth.se)