conditioning by adaptive sampling for robust design
play

Conditioning by adaptive sampling for robust design David Brookes - PowerPoint PPT Presentation

Conditioning by adaptive sampling for robust design David Brookes Jennifer Listgarten Biophysics Graduate Group EECS and Center for Computational Biology University California, Berkeley University California, Berkeley Motivating problem:


  1. Conditioning by adaptive sampling for robust design David Brookes Jennifer Listgarten Biophysics Graduate Group EECS and Center for Computational Biology University California, Berkeley University California, Berkeley

  2. Motivating problem: design protein sequences • Proteins are made up of sequences of amino acids (20 possibilities) • Huge variety of proteins whose function we would like to improve

  3. Motivating problem: design protein sequences • Proteins are made up of sequences of amino acids (20 possibilities) • Huge variety of proteins whose function we would like to improve Proteins that fluoresce

  4. Motivating problem: design protein sequences • Proteins are made up of sequences of amino acids (20 possibilities) • Huge variety of proteins whose function we would like to improve Proteins that fluoresce … that act as drugs

  5. Motivating problem: design protein sequences • Proteins are made up of sequences of amino acids (20 possibilities) • Huge variety of proteins whose function we would like to improve … that fixate Proteins that carbon in the fluoresce atmosphere … that act as drugs

  6. Motivating problem: design protein sequences • Proteins are made up of sequences of amino acids (20 possibilities) • Huge variety of proteins whose function we would like to improve … that fixate Proteins that carbon in the fluoresce atmosphere … that deliver …. that act as gene-editing drugs tools to tissues

  7. How to map sequence to function? How to map sequence to function? A law of molecular biology: A law of molecular biology: Sequence Structure Function Sequence Structure Function ex: fluorescence Hughes A, Mort M, Carlisle F , et al B04 Alternative Splicing In Htt Journal of Neurology, Neurosurgery & Psychiatry 2014; 85: A10. Hughes A, Mort M, Carlisle F , et al B04 Alternative Splicing In Htt Journal of Neurology, Neurosurgery & Psychiatry 2014; 85: A10. http://www.rcsb.org/structure/6FWW http://www.rcsb.org/structure/6FWW

  8. Bypassing the structure relationships A law of molecular biology: Sequence Structure Function High throughput experiments (& ML) Hughes A, Mort M, Carlisle F , et al B04 Alternative Splicing In Htt Journal of Neurology, Neurosurgery & Psychiatry 2014; 85: A10. http://www.rcsb.org/structure/6FWW

  9. Can we solve the inverse problem? A law of molecular biology: Sequence Structure Function Design problem: Given a model, find sequences with desired function Hughes A, Mort M, Carlisle F , et al B04 Alternative Splicing In Htt Journal of Neurology, Neurosurgery & Psychiatry 2014; 85: A10. http://www.rcsb.org/structure/6FWW

  10. Why is protein design difficult? • Huge, rugged search space ⟹ size scales as 20 $ Atoms in universe Grains of sand on earth

  11. Why is protein design difficult? • Huge, rugged search space ⟹ size scales as 20 $ • Discrete search space (no gradients) Atoms in universe Grains of sand on earth

  12. Why is protein design difficult? • Huge, rugged search space ⟹ size scales as 20 $ • Discrete search space (no gradients) Atoms in universe • Uncertainty in predictor Grains of sand on earth https://livingthing.danmackinlay.name/gaussian_processes.html69

  13. Possible solution: model-based optimization (MBO) Idea: replace the standard (hard) objective e.g. the space of sequences

  14. Possible solution: model-based optimization (MBO) Idea: replace the standard (hard) objective with a potentially easier one model over sequence space the space of sequences

  15. Possible solution: model-based optimization (MBO) Idea: replace the standard (hard) objective with a potentially easier one Solution approach is to iterate: 1. Sample from “search model” 𝑞 𝑦 𝜄 2. Evaluate samples on 𝑔 𝑦 3. Adjust 𝜄 so the model favors samples with large function evals

  16. Possible solution: model-based optimization (MBO) Idea: replace the standard (hard) objective with a potentially easier one Solution approach is to iterate: ü Model can sample broad 1. Sample from “search model” 𝑞 𝑦 𝜄 areas of sequence space 2. Evaluate samples on 𝑔 𝑦 ü Does not require gradients of 𝑔 3. Adjust 𝜄 so the model favors ü Can incorporate sequences with large function evals uncertainty

  17. First attempt at MBO for protein design: Design by Adaptive Sampling (DbAS) Our aim is solve the MBO objective:

  18. First attempt at MBO for protein design: Design by Adaptive Sampling (DbAS) Our aim is solve the MBO objective: where 𝑞 𝑦 𝜄 is the search model (VAE, HMM…) •

  19. First attempt at MBO for protein design: Design by Adaptive Sampling (DbAS) Our aim is solve the MBO objective: where 𝑞 𝑦 𝜄 is the search model (VAE, HMM…) • 𝑇 is desired set of property values • à e.g. fluorescence > 𝛽

  20. First attempt at MBO for protein design: Design by Adaptive Sampling (DbAS) Our aim is solve the MBO objective: where 𝑞 𝑦 𝜄 is the search model (VAE, HMM…) • 𝑇 is desired set of property values • à e.g. fluorescence > 𝛽 𝑄(𝑇|𝑦) is a stochastic predictive model (“oracle”) • that maps sequences to property

  21. Design by Adaptive Sampling (cont.) Two issues: 1. 𝜄 is in the expectation distribution.

  22. Design by Adaptive Sampling (cont.) maximize a lower bound Two issues: 1. 𝜄 is in the expectation distribution. ≥

  23. Design by Adaptive Sampling (cont.) maximize a lower bound Two issues: 1. 𝜄 is in the expectation distribution. ≥ 2. MC estimates for rare events.

  24. Design by Adaptive Sampling (cont.) maximize a lower bound Two issues: 1. 𝜄 is in the expectation distribution. ≥ 2. MC estimates for rare events. anneal a sequence of relaxations: 𝑇 0 → 𝑇 , where 𝑇 0 ⊃ 𝑇 034

  25. Design by Adaptive Sampling (cont.) maximize a lower bound Two issues: 1. 𝜄 is in the expectation distribution. ≥ 2. MC estimates for rare events. Anneal and MC

  26. Design by Adaptive Sampling (cont.) maximize a lower bound Two issues: 1. 𝜄 is in the expectation Assumes oracle is unbiased and distribution. ≥ has good uncertainty estimates 2. MC estimates for rare events. Anneal and MC

  27. How pathological oracles lead you astray

  28. How pathological oracles lead you astray Acceptable Many training examples

  29. How pathological oracles lead you astray Acceptable Pathological Many training examples Fewer training examples

  30. How pathological oracles lead you astray Acceptable Pathological Idea: estimate training distribution of x conditioned on high values of oracle

  31. Fixing pathological oracles w/ conditioning Idea: estimate training distribution of x conditioned on high values of oracle

  32. Fixing pathological oracles w/ conditioning Idea: estimate training distribution of x conditioned on high values of oracle Don’t have access to training distribution, but can build a model 𝑞 𝒚 𝜾 7 to approximate it

  33. Conditioning by Adaptive Sampling (CbAS) Previous formulation: New formulation: ≥ 𝑞 𝒚 𝜾 (𝟏) models the training distribution Anneal and MC

  34. Conditioning by Adaptive Sampling (CbAS) Previous formulation: New formulation: = ≥ Anneal and MC

  35. Conditioning by Adaptive Sampling (CbAS) Previous formulation: New formulation: = ≥ Anneal and MC Can’t anneal when sampling dist. doesn’t change!

  36. Conditioning by Adaptive Sampling (CbAS) Previous formulation: New formulation: = ≥ = Anneal and MC Importance sampling proposal dist.

  37. Conditioning by Adaptive Sampling (CbAS) Previous formulation: New formulation: = ≥ ≥ = Anneal and MC Anneal and MC Anneal and MC

  38. Testing is fundamentally different • We don’t trust our oracle and generally can’t query the ground truth

  39. Testing is fundamentally different • We don’t trust our oracle and generally can’t query the ground truth • We can’t hold-out a test set of good sequences • Near-zero chance of any of these sequences being found by the method Test set

  40. Testing is fundamentally different • We don’t trust our oracle and generally can’t query the ground truth • We can’t hold-out a test set of good sequences • Near-zero chance of any of these sequences being found by the method • We can’t use some canonical test function as the oracle • In our problem it is untrustworthy

  41. Testing strategy • Simulate a ground truth based on real data Ground à “Ground truth” is a GP mean function truth GP

  42. Testing strategy • Simulate a ground truth based on real data Ground à “Ground truth” is a GP mean function truth GP • Ground truth vales values are sampled from the GP for given sequences • Use these input-output pairs to train oracles. Training data Oracles

  43. Testing strategy • Simulate a ground truth based on real data à “Ground truth” is a GP mean function • Ground truth vales values are sampled from the GP for given sequences • Use these input-output pairs to train oracles • Coerce training set so these oracles exhibit pathologies

  44. Results

Recommend


More recommend