transfer string kernel for cross context sequence
play

Transfer String Kernel for Cross-Context Sequence Specific - PowerPoint PPT Presentation

Transfer String Kernel for Cross-Context Sequence Specific DNA-Protein Binding Prediction by Ritambhara Singh IIIT-Delhi June 10, 2016 1 Biology in a Slide CELL PROTEIN RNA DNA ORGANISM 2 DNA and Diseases Down Syndrome


  1. Transfer String Kernel for Cross-Context Sequence Specific DNA-Protein Binding Prediction by Ritambhara Singh IIIT-Delhi June 10, 2016 1

  2. Biology in a Slide CELL PROTEIN RNA DNA ORGANISM 2

  3. DNA and Diseases • Down Syndrome • Parkinson’s Disease • Autism • Muscular Atrophy CELL PROTEIN • Sickle Cell Disease RNA DNA ………. ORGANISM ……….. 3

  4. Transcription Factors Transcription Factor ATCGCGTAGCTAGGGATGACAGACACACATAATTCTAGATA ¡ CELL PROTEIN Gene Transcription RNA Factor DNA Binding Site ORGANISM 4

  5. ChIP-seq Maps TF binding Transcription Factor DNA Gene Transcription Factor Binding Site CHIP-SEQ ChIP-seq Peak Map for TF ATATCGTATCTAAACCGCCTCGG ¡ ATATCGTATCTTTTAAACCGGGTTGGCCACTAGA ¡ Genome 5

  6. TF Binding Differs Across Contexts ATATCGTATCTTTTAAACCGGGTATGTAATGCAT ¡ ATATCGTATCTAAACCGCCCGTGT ¡ ATATCGTATCTTTTAAACCGGGTTGGCCAGTATA ¡ ATATCGTATCTAAACCGCCCTGCA ¡ 6

  7. Current Challenge: ENCODE Data Gap (Blood Cell) (Stem Cell) (Leukemia) (Lung Cancer) (Immunity related) ? ? (Nerve Cell) (Cervical Cancer) Source : http://genome.ucsc.edu/ENCODE/dataMatrix/encodeChipMatrixHuman.html 7

  8. Case for Computational Tools 8

  9. Existing Computational Tools Generative Discriminative Approaches Approaches MEME STRING KERNEL+SVM CISFINDER 9

  10. Generative : PWM Based approach ChIP-seq Peak Map for TF Genome ATATCGTATAACAATAACCGGGAACTAATAGC ¡ ATATCGTATCTAACAAATCCTACT ¡ 1 2 3 4 5 6 7 8 9 10 11 12 Position A 14 0 0 14 28 40 9 45 42 13 15 9 Weight T 12 3 4 12 11 10 9 6 5 38 12 3 Matrix C 3 0 1 8 2 2 36 2 2 0 1 0 G 0 1 0 16 10 1 2 3 2 0 7 11 Sequence Logo 10

  11. Generative Approach : Output ? ? Genome ATATCGTATCTTTTAAACCGGGTTGGCCAATAGC ¡ ATATCGTATCTAAACCGCCCTACT ¡ 11 Source : http://www.cbil.upenn.edu/EpoDB/release/version_2.2/meme/meme-output.html#sample

  12. Generative Approach: Limitations – Output: Long list of potential TFs – Work well for only well preserved motifs or large training datasets – PWMs for all ~2000 TFs not available – Lower prediction performance than discriminative approaches 12

  13. Discriminative Approach : Output ? ? Genome ATATCGTATCTTTTAAACCGGGTTGGCCAATAGC ¡ ATATCGTATCTAAACCGCCCTACT ¡ Peak Genome ATATCGTATCTTTTAAACCGGGTTGGCCAATAGC ¡ ATATCGTATCTAAACCGCCCTACT ¡ +1 -1 13

  14. Discriminative : String Kernel Approach Support Vector Machine 14

  15. Discriminative Approach : Limitation Assumption: Training/test data follow same distribution regardless of context. 15

  16. Aim • Improve prediction of Transcription Factor Binding sites across contexts using knowledge transfer. 16

  17. Proposed Solution : Cross-Context Knowledge Transfer 17

  18. Transfer String Kernel : Overview Xs Xt Target ATCGAT ATACAT Source Context GTATAC GCTTAC Context Feature Feature Conversion Conversion Knowledge Transfer (KMM) Training Classification 18

  19. Outline • Method – String Kernel – Support Vector Machine – Transfer Learning (KMM) – Importance re-weighting – Transfer String Kernel • Evaluation – Experimental Setup – Cross-context TFBS prediction – Cross-context Protein Binding prediction 19

  20. Outline • Method – String Kernel – Support Vector Machine – Transfer Learning (KMM) – Importance re-weighting – Transfer String Kernel • Evaluation – Experimental Setup – Cross-context TFBS prediction – Cross-context Protein Binding prediction 20

  21. String Kernel : Spectrum Kernel Feature map indexed by all k-length subsequences (“k-mers”) from alphabet Σ of amino acids, | Σ |=20 21

  22. String Kernel : Mismatch Kernel For k-mer s , the mismatch neighborhood N (k,m) (s) is the set of all k-mers t within m mismatches from s . 22

  23. Outline • Method – String Kernel – Support Vector Machine – Transfer Learning (KMM) – Importance re-weighting – Transfer String Kernel • Evaluation – Experimental Setup – Cross-context TFBS prediction – Cross-context Protein Binding prediction 23

  24. Support Vector Machine w . x + b = 0 w . x + b ≤ -1 w . x + b ≥ +1 Negative Instances (y = -1) Positive Instances (y = +1) 24

  25. Outline • Method – String Kernel – Support Vector Machine – Transfer Learning (KMM) – Importance re-weighting – Transfer String Kernel • Evaluation – Experimental Setup – Cross-context TFBS prediction – Cross-context Protein Binding prediction 25

  26. Transfer Learning (KMM) True Ratios densities r(x) p tr (x) r(x) p te (x) 26

  27. Outline • Method – String Kernel – Support Vector Machine – Transfer Learning (KMM) – Importance re-weighting – Transfer String Kernel • Evaluation – Experimental Setup – Cross-context TFBS prediction – Cross-context Protein Binding prediction 27

  28. Importance Re-weighting Original Weights KMM Weights 28

  29. Outline • Method – String Kernel – Support Vector Machine – Transfer Learning (KMM) – Importance re-weighting – Transfer String Kernel • Evaluation – Experimental Setup – Cross-context TFBS prediction – Cross-context Protein Binding prediction 29

  30. Transfer String Kernel (TSK) Target ! ATCGATCG CCCGATCG Source ! ATCGATCG% Context ! CTCGCTCC% Context ! Mismatch Mismatch Feature String Feature String Conversion ! Conversion ! Kernel Kernel Kernel Mean Knowledge Matching Transfer ! (KMM) Importance Re- Training ! Classification ! weighting SVM 30

  31. Outline • Method – String Kernel – Support Vector Machine – Transfer Learning (KMM) – Importance re-weighting – Transfer String Kernel • Evaluation – Experimental Setup – Cross-context TFBS prediction – Cross-context Protein Binding prediction 31

  32. Experimental Setup • 14 Transcription Factors (ENCODE ChIP-seq) • Top 1000 positive sequences (500 training and 500 testing) • 1000 random negative sequences • Hyper-parameter tuning for k=(8,10,12) and m=(1,2,3) • Dictionary size = 4 {A,T,C,G} 32

  33. Outline • Method – String Kernel – Support Vector Machine – Transfer Learning (KMM) – Importance re-weighting – Transfer String Kernel • Evaluation – Experimental Setup – Cross-context TFBS prediction – Cross-context Protein Binding prediction 33

  34. Results 34

  35. Results – Cross Context 0.9 0.88 AUC Score 0.86 0.84 0.82 0.8 Sin3a Max Mxi1 Chd2 Ctcf Transcription Factors TSK SK 35

  36. Outline • Method – String Kernel – Support Vector Machine – Transfer Learning (KMM) – Importance re-weighting – Transfer String Kernel • Evaluation – Experimental Setup – Cross-context TFBS prediction – Cross-context Protein Binding prediction 36

  37. Results – Cross context 37

  38. Summary • TSK overall improves the cross-context TFBS predictions; • String kernel based approaches perform better than the state-of- the-art Position Weight/Frequency Matrix based TFBS tools; • TSK approach is generalizable for performance improvement of any cross-context sequence prediction task. Presented in BIOKDD ’15 38

  39. Acknowledgements Dr. Mazhar Adli Adli Lab : Department of Biochemistry and Molecular Genetics @Uva Nipun Batra IIIT-Delhi 39

  40. Machine Learning Lab @ UVa Beilun Wang Dr. Yanjun Qi (Advisor) Weilin Xu Jack Lanchantin Ji Gao 40

  41. Future Directions • Deep Learning : – Gene expression prediction using histone modification data (ECCB 2016) – Improving TFBS prediction using DNA sequences (ICLR Workshop 2016, ICML Workshop 2016) • String Kernels: Improving efficiency!! (on-going work) 41

  42. Thank You 42

Recommend


More recommend