Modern Computational Statistics Lecture 20: Applications in Computational Biology Cheng Zhang School of Mathematical Sciences, Peking University December 09, 2019
Introduction 2/23 ◮ While modern statistical approaches have been quite successful in many application areas, there are still challenging areas where the complex model structures make it difficult to apply those methods. ◮ In this lecture, we will discuss some of the recent advancement on statistical approaches for computational biology, with an emphasis on evolutionary models.
Challenges in Computational Biology 3/23 Adapted from Narges Razavian 2013
Phylogenetic Inference 4/23 The goal of phylogenetic inference is to reconstruct the evolution history (e.g., phylogenetic trees ) from molecular sequence data (e.g., DNA, RNA or protein sequences) Taxa Characters A Species A ATGAACAT B Species B ATGCACAC C ATGCATAT Species C D Species D ATGCATGC Molecular Sequence Data Phylogenetic Tree Lots of modern biological and medical applications: predict the evolution of influenza viruses and help vaccine design, etc.
Example: B Cell Evolution 5/23 This happens inside of you!
Example: B Cell Evolution 5/23 This happens inside of you!
Example: B Cell Evolution 5/23 This happens inside of you!
Example: B Cell Evolution 5/23 This happens inside of you! These inferences guide rational vaccine design.
Bayesian Phylogenetics 6/23 ATGAAC · · · ATGCAC · · · ATGCAT · · · ATGCAT · · · ( τ, q ) y 1 y 2 y 3 y 4 y 5 y 6
Bayesian Phylogenetics 6/23 ATGAAC · · · Evolution model: ch e ATGCAC · · · p (ch | pa , q e ) pa ATGCAT · · · q e : amount of evolution on e . ATGCAT · · · ( τ, q ) y 1 y 2 y 3 y 4 y 5 y 6
Bayesian Phylogenetics 6/23 ATGAAC · · · A Evolution model: ATGCAC · · · A p (ch | pa , q e ) ATGCAT · · · A q e : amount of evolution on e . ATGCAT · · · A ( τ, q ) y 1 y 2 y 3 y 4 y 5 y 6 Likelihood � η ( a i p ( Y | τ, q ) = ρ ) P a i v ( q uv ) u a i ( u,v ) ∈ E ( τ )
Bayesian Phylogenetics 6/23 ATGAAC · · · A Evolution model: � � ATGCAC · · · A p (ch | pa , q e ) � � ATGCAT · · · A q e : amount of evolution on e . � � ATGCAT · · · A ( τ, q ) y 1 y 2 y 3 y 4 y 5 y 6 Likelihood � � η ( a i p ( Y | τ, q ) = ρ ) P a i v ( q uv ) u a i a i ( u,v ) ∈ E ( τ )
Bayesian Phylogenetics 6/23 ATGAAC · · · T Evolution model: � � ATGCAC · · · T p (ch | pa , q e ) � � ATGCAT · · · T q e : amount of evolution on e . � � ATGCAT · · · T ( τ, q ) y 1 y 2 y 3 y 4 y 5 y 6 Likelihood M � � � η ( a i p ( Y | τ, q ) = ρ ) P a i v ( q uv ) u a i i =1 a i ( u,v ) ∈ E ( τ )
Bayesian Phylogenetics 6/23 ATGAAC · · · G Evolution model: � � ATGCAC · · · G p (ch | pa , q e ) � � ATGCAT · · · G q e : amount of evolution on e . � � ATGCAT · · · G ( τ, q ) y 1 y 2 y 3 y 4 y 5 y 6 Likelihood M � � � η ( a i p ( Y | τ, q ) = ρ ) P a i v ( q uv ) u a i i =1 a i ( u,v ) ∈ E ( τ )
Bayesian Phylogenetics 6/23 ATGAAC · · · A Evolution model: � � ATGCAC · · · C p (ch | pa , q e ) � � ATGCAT · · · C q e : amount of evolution on e . � � ATGCAT · · · C ( τ, q ) y 1 y 2 y 3 y 4 y 5 y 6 Likelihood M � � � η ( a i p ( Y | τ, q ) = ρ ) P a i v ( q uv ) u a i i =1 a i ( u,v ) ∈ E ( τ )
Bayesian Phylogenetics 6/23 ATGAAC · · · A Evolution model: � � ATGCAC · · · A p (ch | pa , q e ) � � ATGCAT · · · A q e : amount of evolution on e . � � ATGCAT · · · A ( τ, q ) y 1 y 2 y 3 y 4 y 5 y 6 Likelihood M � � � η ( a i p ( Y | τ, q ) = ρ ) P a i v ( q uv ) u a i i =1 a i ( u,v ) ∈ E ( τ )
Bayesian Phylogenetics 6/23 ATGAAC · · · C Evolution model: � � ATGCAC · · · C p (ch | pa , q e ) � � ATGCAT · · · T q e : amount of evolution on e . � � ATGCAT · · · T ( τ, q ) y 1 y 2 y 3 y 4 y 5 y 6 Likelihood M � � � η ( a i p ( Y | τ, q ) = ρ ) P a i v ( q uv ) u a i i =1 a i ( u,v ) ∈ E ( τ )
Bayesian Phylogenetics 6/23 ATGAAC · · · C Evolution model: � � ATGCAC · · · C p (ch | pa , q e ) � � ATGCAT · · · T q e : amount of evolution on e . � � ATGCAT · · · T ( τ, q ) y 1 y 2 y 3 y 4 y 5 y 6 Likelihood M � � � η ( a i p ( Y | τ, q ) = ρ ) P a i v ( q uv ) u a i i =1 a i ( u,v ) ∈ E ( τ )
Bayesian Phylogenetics 6/23 ATGAAC · · · C Evolution model: � � ATGCAC · · · C p (ch | pa , q e ) � � ATGCAT · · · T q e : amount of evolution on e . � � ATGCAT · · · T ( τ, q ) y 1 y 2 y 3 y 4 y 5 y 6 Likelihood M � � � η ( a i p ( Y | τ, q ) = ρ ) P a i v ( q uv ) u a i i =1 a i ( u,v ) ∈ E ( τ ) Given a proper prior distribution p ( τ, q ), the posterior is p ( τ, q | Y ) ∝ p ( Y | τ, q ) p ( τ, q ) .
Markov chain Monte Carlo 7/23 Random-walk MCMC (MrBayes, BEAST): ◮ simple random perturbation (e.g., Nearest Neighborhood Interchange) to generate new state. NNI Challenges for MCMC ◮ Large search space: (2 n − 5)!! unrooted trees ( n taxa) ◮ Intertwined parameter space, low acceptance rate, hard to scale to data sets with many sequences.
Variational Inference 8/23 p ( θ | x ) q ∗ ( θ ) q ∗ ( θ ) = arg min KL ( q ( θ ) � p ( θ | x )) q ∈ Q Q ◮ VI turns inference into optimization ◮ Specify a variational family of distributions over the model parameters Q = { q φ ( θ ); φ ∈ Φ } ◮ Fit the variational parameters φ to minimize the distance (often in terms of KL divergence) to the exact posterior
Evidence Lower Bound 9/23 L ( θ ) = E q ( θ ) (log p ( x, θ )) − E q ( θ ) (log q ( θ )) ≤ log p ( x ) ◮ KL is intractable; maximizing the evidence lower bound (ELBO) instead, which only requires the joint probability p ( x, θ ). ◮ The ELBO is a lower bound on log p ( x ). ◮ Maximizing the ELBO is equivalent to minimizing the KL. ◮ The ELBO strikes a balance between two terms ◮ The first term encourages q to focus probability mass where the model puts high probability. ◮ The second term encourages q to be diffuse. ◮ As an optimization approach, VI tends to be faster than MCMC, and is easier to scale to large data sets (via stochastic gradient ascent)
Subsplit Bayesian Networks 10/23 Inspired by previous works (H¨ ohna and Drummond 2012, Larget 2013), we can decompose trees into local structures and encode the tree topology space via Bayesian networks! A B C D A B C D
Subsplit Bayesian Networks 10/23 Inspired by previous works (H¨ ohna and Drummond 2012, Larget 2013), we can decompose trees into local structures and encode the tree topology space via Bayesian networks! A B ABC D C D A B AB CD C D
Subsplit Bayesian Networks 10/23 Inspired by previous works (H¨ ohna and Drummond 2012, Larget 2013), we can decompose trees into local structures and encode the tree topology space via Bayesian networks! A A BC B ABC D C 1.0 D D A A B B AB CD C C D D
Subsplit Bayesian Networks 10/23 Inspired by previous works (H¨ ohna and Drummond 2012, Larget 2013), we can decompose trees into local structures and encode the tree topology space via Bayesian networks! A A 1.0 A B BC B C ABC D C 1.0 D 1.0 D D 1.0 D A 1.0 A A B 1.0 B B AB CD C C 1.0 C D 1.0 D D
Subsplit Bayesian Networks 10/23 Inspired by previous works (H¨ ohna and Drummond 2012, Larget 2013), we can decompose trees into local structures and encode the tree topology space via Bayesian networks! A A 1.0 A S 4 B BC B C ABC D C 1.0 D 1.0 S 2 D S 5 D 1.0 D S 1 S 6 A S 3 1.0 A A B 1.0 B B AB S 7 CD C C 1.0 C D 1.0 D D
Probability Estimation Over Tree Topologies 11/23 A A 1.0 A S 4 B BC B C ABC D C 1.0 D 1.0 S 2 D S 5 D 1.0 D S 1 S 6 A S 3 1.0 A A B 1.0 B B AB S 7 CD C C 1.0 C D 1.0 D D Rooted Trees � p ( S i = s i | S π i = s π i ) . p sbn ( T = τ ) = p ( S 1 = s 1 ) i> 1
Probability Estimation Over Tree Topologies 11/23 A A t o o n r u 1 A / t B A B o A o S 4 r A BCD C B 1 2 B CD S 2 D C S 5 D 3 S 1 A A S 6 4 5 A B S 3 r o B B o t / 3 AB u C D n r o o t CD C C C S 7 D D D Unrooted Trees : p sbn ( T u = τ ) = � � p ( S 1 = s 1 ) p ( S i = s i | S π i = s π i ) . s 1 ∼ τ i> 1
Tree Probability Estimation via SBNs 12/23 SBNs can be used to learn a probability distribution based on a collection of trees T = { T 1 , · · · , T K } . T k = { S i = s i,k , i ≥ 1 } , k = 1 , . . . , K Rooted Trees ◮ Maximum Likelihood Estimates : relative frequencies. p MLE ( S 1 = s 1 ) = m s 1 m s i ,t i p MLE ( S i = s i | S π i = t i ) = ˆ ˆ K , � s ∈ C i m s,t i Unrooted Trees ◮ Expectation Maximization � � p EM , ( n +1) = arg max � ˆ log p ( S 1 ) + log p ( S i | S π i ) E p ( S 1 | T, ˆ p EM , (n) ) p i> 1
Recommend
More recommend