Modern Computational Statistics Lecture 20: Applications in - PowerPoint PPT Presentation

Modern Computational Statistics Lecture 20: Applications in Computational Biology Cheng Zhang School of Mathematical Sciences, Peking University December 09, 2019

Introduction 2/23 ◮ While modern statistical approaches have been quite successful in many application areas, there are still challenging areas where the complex model structures make it difficult to apply those methods. ◮ In this lecture, we will discuss some of the recent advancement on statistical approaches for computational biology, with an emphasis on evolutionary models.

Challenges in Computational Biology 3/23 Adapted from Narges Razavian 2013

Phylogenetic Inference 4/23 The goal of phylogenetic inference is to reconstruct the evolution history (e.g., phylogenetic trees ) from molecular sequence data (e.g., DNA, RNA or protein sequences) Taxa Characters A Species A ATGAACAT B Species B ATGCACAC C ATGCATAT Species C D Species D ATGCATGC Molecular Sequence Data Phylogenetic Tree Lots of modern biological and medical applications: predict the evolution of influenza viruses and help vaccine design, etc.

Example: B Cell Evolution 5/23 This happens inside of you!

Example: B Cell Evolution 5/23 This happens inside of you! These inferences guide rational vaccine design.

Bayesian Phylogenetics 6/23 ATGAAC · · · ATGCAC · · · ATGCAT · · · ATGCAT · · · ( τ, q ) y 1 y 2 y 3 y 4 y 5 y 6

Bayesian Phylogenetics 6/23 ATGAAC · · · Evolution model: ch e ATGCAC · · · p (ch | pa , q e ) pa ATGCAT · · · q e : amount of evolution on e . ATGCAT · · · ( τ, q ) y 1 y 2 y 3 y 4 y 5 y 6

Bayesian Phylogenetics 6/23 ATGAAC · · · A Evolution model: ATGCAC · · · A p (ch | pa , q e ) ATGCAT · · · A q e : amount of evolution on e . ATGCAT · · · A ( τ, q ) y 1 y 2 y 3 y 4 y 5 y 6 Likelihood � η ( a i p ( Y | τ, q ) = ρ ) P a i v ( q uv ) u a i ( u,v ) ∈ E ( τ )

Bayesian Phylogenetics 6/23 ATGAAC · · · A Evolution model: � � ATGCAC · · · A p (ch | pa , q e ) � � ATGCAT · · · A q e : amount of evolution on e . � � ATGCAT · · · A ( τ, q ) y 1 y 2 y 3 y 4 y 5 y 6 Likelihood � � η ( a i p ( Y | τ, q ) = ρ ) P a i v ( q uv ) u a i a i ( u,v ) ∈ E ( τ )

Bayesian Phylogenetics 6/23 ATGAAC · · · T Evolution model: � � ATGCAC · · · T p (ch | pa , q e ) � � ATGCAT · · · T q e : amount of evolution on e . � � ATGCAT · · · T ( τ, q ) y 1 y 2 y 3 y 4 y 5 y 6 Likelihood M � � � η ( a i p ( Y | τ, q ) = ρ ) P a i v ( q uv ) u a i i =1 a i ( u,v ) ∈ E ( τ )

Bayesian Phylogenetics 6/23 ATGAAC · · · G Evolution model: � � ATGCAC · · · G p (ch | pa , q e ) � � ATGCAT · · · G q e : amount of evolution on e . � � ATGCAT · · · G ( τ, q ) y 1 y 2 y 3 y 4 y 5 y 6 Likelihood M � � � η ( a i p ( Y | τ, q ) = ρ ) P a i v ( q uv ) u a i i =1 a i ( u,v ) ∈ E ( τ )

Bayesian Phylogenetics 6/23 ATGAAC · · · A Evolution model: � � ATGCAC · · · C p (ch | pa , q e ) � � ATGCAT · · · C q e : amount of evolution on e . � � ATGCAT · · · C ( τ, q ) y 1 y 2 y 3 y 4 y 5 y 6 Likelihood M � � � η ( a i p ( Y | τ, q ) = ρ ) P a i v ( q uv ) u a i i =1 a i ( u,v ) ∈ E ( τ )

Bayesian Phylogenetics 6/23 ATGAAC · · · A Evolution model: � � ATGCAC · · · A p (ch | pa , q e ) � � ATGCAT · · · A q e : amount of evolution on e . � � ATGCAT · · · A ( τ, q ) y 1 y 2 y 3 y 4 y 5 y 6 Likelihood M � � � η ( a i p ( Y | τ, q ) = ρ ) P a i v ( q uv ) u a i i =1 a i ( u,v ) ∈ E ( τ )

Bayesian Phylogenetics 6/23 ATGAAC · · · C Evolution model: � � ATGCAC · · · C p (ch | pa , q e ) � � ATGCAT · · · T q e : amount of evolution on e . � � ATGCAT · · · T ( τ, q ) y 1 y 2 y 3 y 4 y 5 y 6 Likelihood M � � � η ( a i p ( Y | τ, q ) = ρ ) P a i v ( q uv ) u a i i =1 a i ( u,v ) ∈ E ( τ )

Bayesian Phylogenetics 6/23 ATGAAC · · · C Evolution model: � � ATGCAC · · · C p (ch | pa , q e ) � � ATGCAT · · · T q e : amount of evolution on e . � � ATGCAT · · · T ( τ, q ) y 1 y 2 y 3 y 4 y 5 y 6 Likelihood M � � � η ( a i p ( Y | τ, q ) = ρ ) P a i v ( q uv ) u a i i =1 a i ( u,v ) ∈ E ( τ ) Given a proper prior distribution p ( τ, q ), the posterior is p ( τ, q | Y ) ∝ p ( Y | τ, q ) p ( τ, q ) .

Markov chain Monte Carlo 7/23 Random-walk MCMC (MrBayes, BEAST): ◮ simple random perturbation (e.g., Nearest Neighborhood Interchange) to generate new state. NNI Challenges for MCMC ◮ Large search space: (2 n − 5)!! unrooted trees ( n taxa) ◮ Intertwined parameter space, low acceptance rate, hard to scale to data sets with many sequences.

Variational Inference 8/23 p ( θ | x ) q ∗ ( θ ) q ∗ ( θ ) = arg min KL ( q ( θ ) � p ( θ | x )) q ∈ Q Q ◮ VI turns inference into optimization ◮ Specify a variational family of distributions over the model parameters Q = { q φ ( θ ); φ ∈ Φ } ◮ Fit the variational parameters φ to minimize the distance (often in terms of KL divergence) to the exact posterior

Evidence Lower Bound 9/23 L ( θ ) = E q ( θ ) (log p ( x, θ )) − E q ( θ ) (log q ( θ )) ≤ log p ( x ) ◮ KL is intractable; maximizing the evidence lower bound (ELBO) instead, which only requires the joint probability p ( x, θ ). ◮ The ELBO is a lower bound on log p ( x ). ◮ Maximizing the ELBO is equivalent to minimizing the KL. ◮ The ELBO strikes a balance between two terms ◮ The first term encourages q to focus probability mass where the model puts high probability. ◮ The second term encourages q to be diffuse. ◮ As an optimization approach, VI tends to be faster than MCMC, and is easier to scale to large data sets (via stochastic gradient ascent)

Subsplit Bayesian Networks 10/23 Inspired by previous works (H¨ ohna and Drummond 2012, Larget 2013), we can decompose trees into local structures and encode the tree topology space via Bayesian networks! A B C D A B C D

Subsplit Bayesian Networks 10/23 Inspired by previous works (H¨ ohna and Drummond 2012, Larget 2013), we can decompose trees into local structures and encode the tree topology space via Bayesian networks! A B ABC D C D A B AB CD C D

Subsplit Bayesian Networks 10/23 Inspired by previous works (H¨ ohna and Drummond 2012, Larget 2013), we can decompose trees into local structures and encode the tree topology space via Bayesian networks! A A BC B ABC D C 1.0 D D A A B B AB CD C C D D

Subsplit Bayesian Networks 10/23 Inspired by previous works (H¨ ohna and Drummond 2012, Larget 2013), we can decompose trees into local structures and encode the tree topology space via Bayesian networks! A A 1.0 A B BC B C ABC D C 1.0 D 1.0 D D 1.0 D A 1.0 A A B 1.0 B B AB CD C C 1.0 C D 1.0 D D

Subsplit Bayesian Networks 10/23 Inspired by previous works (H¨ ohna and Drummond 2012, Larget 2013), we can decompose trees into local structures and encode the tree topology space via Bayesian networks! A A 1.0 A S 4 B BC B C ABC D C 1.0 D 1.0 S 2 D S 5 D 1.0 D S 1 S 6 A S 3 1.0 A A B 1.0 B B AB S 7 CD C C 1.0 C D 1.0 D D

Probability Estimation Over Tree Topologies 11/23 A A 1.0 A S 4 B BC B C ABC D C 1.0 D 1.0 S 2 D S 5 D 1.0 D S 1 S 6 A S 3 1.0 A A B 1.0 B B AB S 7 CD C C 1.0 C D 1.0 D D Rooted Trees � p ( S i = s i | S π i = s π i ) . p sbn ( T = τ ) = p ( S 1 = s 1 ) i> 1

Probability Estimation Over Tree Topologies 11/23 A A t o o n r u 1 A / t B A B o A o S 4 r A BCD C B 1 2 B CD S 2 D C S 5 D 3 S 1 A A S 6 4 5 A B S 3 r o B B o t / 3 AB u C D n r o o t CD C C C S 7 D D D Unrooted Trees : p sbn ( T u = τ ) = � � p ( S 1 = s 1 ) p ( S i = s i | S π i = s π i ) . s 1 ∼ τ i> 1

Tree Probability Estimation via SBNs 12/23 SBNs can be used to learn a probability distribution based on a collection of trees T = { T 1 , · · · , T K } . T k = { S i = s i,k , i ≥ 1 } , k = 1 , . . . , K Rooted Trees ◮ Maximum Likelihood Estimates : relative frequencies. p MLE ( S 1 = s 1 ) = m s 1 m s i ,t i p MLE ( S i = s i | S π i = t i ) = ˆ ˆ K , � s ∈ C i m s,t i Unrooted Trees ◮ Expectation Maximization � � p EM , ( n +1) = arg max � ˆ log p ( S 1 ) + log p ( S i | S π i ) E p ( S 1 | T, ˆ p EM , (n) ) p i> 1

Modern Computational Statistics Lecture 20: Applications in - PowerPoint PPT Presentation

Modern Computational Statistics Lecture 20: Applications in Computational Biology Cheng Zhang School of Mathematical Sciences, Peking University December 09, 2019 Introduction 2/23 While modern statistical approaches have been quite

MODERN 1 MODERN 2 MODERN 3 MODERN 4 MODERN A peep at some distant orb has power to raise

Modern Risk Modern Risk Modern Risk Management Modern Risk Management anagement Concepts:

Official Statistics Matt Dray, Assistant Statistician Official Statistics 2 Official

1. Computational Fluid a. Computational Fluid Dynamics is in the domain of Computational Science

Modern OLTP Indexes (Part 2) 1 / 43 Modern OLTP Indexes (Part 2) Recap Recap 2 / 43 Modern OLTP

Modern Mississippi Chapters 10-11 MS state flag, seal and Coat of Arms Chp 10 &11 (Modern

Building applications with a db-back-end Content: DD2471 (Lecture 09) Modern database systems

Areal statistics Barry Rowlingson Research Fellow DataCamp Spatial Statistics in R Borders

Fast Algorithms Estimating Statistics . . . Applications to Radar . . . for Computing Statistics

Modern Computational Statistics Lecture 8: Advanced MCMC Cheng Zhang School of Mathematical

Modern Computational Statistics Lecture 4: Numerical Integration Cheng Zhang School of

Modern Computational Statistics Lecture 2: Optimization Cheng Zhang School of Mathematical

CISM MODERN PENTATHLON COMMITTEE CISM Modern Pentathlon Committee Composition of the CISM Modern

Modern Defense Tooling Artificial intelligence as a defense platform Ameen Altajer Modern

Modern Alchemy Modern Alchemy Turning Waste into Gold Turning Waste into Gold Stephen Salter,

INSIDE THE PLATFORM Who are we Classic platforms Classic platform Modern platform Modern

Statistical Learning CS 786 University of Waterloo Lecture 6: May 17, 2012 Decision Tree

Phylogenetic trees Branch confidence Genome 559: Introduction to Statistical and Computational

Human-Oriented Robotics Supervised Learning Part 1/3 Kai Arras Social Robotics Lab, University

Logistic Regression Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824 Administrative

Maximum likelihood threshold of a graph Elizabeth Gross San Jos e State University Joint work

LP Decoding of Regular LDPC Codes in Memoryless Channels Nissim Halabi Guy Even ISIT 2010 1

Statistical Learning Marco Chiarandini Deptartment of Mathematics & Computer Science

Practical Bioinformatics Mark Voorhies 5/26/2015 Mark Voorhies Practical Bioinformatics Habits

Modern Computational Statistics Lecture 20: Applications in - PowerPoint PPT Presentation

Modern Computational Statistics Lecture 20: Applications in Computational Biology Cheng Zhang School of Mathematical Sciences, Peking University December 09, 2019 Introduction 2/23 While modern statistical approaches have been quite

MODERN 1 MODERN 2 MODERN 3 MODERN 4 MODERN A peep at some distant orb has power to raise

Modern Risk Modern Risk Modern Risk Management Modern Risk Management anagement Concepts:

Official Statistics Matt Dray, Assistant Statistician Official Statistics 2 Official

1. Computational Fluid a. Computational Fluid Dynamics is in the domain of Computational Science

Modern OLTP Indexes (Part 2) 1 / 43 Modern OLTP Indexes (Part 2) Recap Recap 2 / 43 Modern OLTP

Modern Mississippi Chapters 10-11 MS state flag, seal and Coat of Arms Chp 10 &amp;11 (Modern

Building applications with a db-back-end Content: DD2471 (Lecture 09) Modern database systems

Areal statistics Barry Rowlingson Research Fellow DataCamp Spatial Statistics in R Borders

Fast Algorithms Estimating Statistics . . . Applications to Radar . . . for Computing Statistics

Modern Computational Statistics Lecture 8: Advanced MCMC Cheng Zhang School of Mathematical

Modern Computational Statistics Lecture 4: Numerical Integration Cheng Zhang School of

Modern Computational Statistics Lecture 2: Optimization Cheng Zhang School of Mathematical

CISM MODERN PENTATHLON COMMITTEE CISM Modern Pentathlon Committee Composition of the CISM Modern

Modern Defense Tooling Artificial intelligence as a defense platform Ameen Altajer Modern

Modern Alchemy Modern Alchemy Turning Waste into Gold Turning Waste into Gold Stephen Salter,

INSIDE THE PLATFORM Who are we Classic platforms Classic platform Modern platform Modern

Statistical Learning CS 786 University of Waterloo Lecture 6: May 17, 2012 Decision Tree

Phylogenetic trees Branch confidence Genome 559: Introduction to Statistical and Computational

Human-Oriented Robotics Supervised Learning Part 1/3 Kai Arras Social Robotics Lab, University

Logistic Regression Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824 Administrative

Maximum likelihood threshold of a graph Elizabeth Gross San Jos e State University Joint work

LP Decoding of Regular LDPC Codes in Memoryless Channels Nissim Halabi Guy Even ISIT 2010 1

Statistical Learning Marco Chiarandini Deptartment of Mathematics &amp; Computer Science

Practical Bioinformatics Mark Voorhies 5/26/2015 Mark Voorhies Practical Bioinformatics Habits

Modern Mississippi Chapters 10-11 MS state flag, seal and Coat of Arms Chp 10 &11 (Modern

Statistical Learning Marco Chiarandini Deptartment of Mathematics & Computer Science