Transfer String Kernel for Cross-Context Sequence Specific DNA-Protein Binding Prediction by Ritambhara Singh IIIT-Delhi June 10, 2016 1
Biology in a Slide CELL PROTEIN RNA DNA ORGANISM 2
DNA and Diseases • Down Syndrome • Parkinson’s Disease • Autism • Muscular Atrophy CELL PROTEIN • Sickle Cell Disease RNA DNA ………. ORGANISM ……….. 3
Transcription Factors Transcription Factor ATCGCGTAGCTAGGGATGACAGACACACATAATTCTAGATA ¡ CELL PROTEIN Gene Transcription RNA Factor DNA Binding Site ORGANISM 4
ChIP-seq Maps TF binding Transcription Factor DNA Gene Transcription Factor Binding Site CHIP-SEQ ChIP-seq Peak Map for TF ATATCGTATCTAAACCGCCTCGG ¡ ATATCGTATCTTTTAAACCGGGTTGGCCACTAGA ¡ Genome 5
TF Binding Differs Across Contexts ATATCGTATCTTTTAAACCGGGTATGTAATGCAT ¡ ATATCGTATCTAAACCGCCCGTGT ¡ ATATCGTATCTTTTAAACCGGGTTGGCCAGTATA ¡ ATATCGTATCTAAACCGCCCTGCA ¡ 6
Current Challenge: ENCODE Data Gap (Blood Cell) (Stem Cell) (Leukemia) (Lung Cancer) (Immunity related) ? ? (Nerve Cell) (Cervical Cancer) Source : http://genome.ucsc.edu/ENCODE/dataMatrix/encodeChipMatrixHuman.html 7
Case for Computational Tools 8
Existing Computational Tools Generative Discriminative Approaches Approaches MEME STRING KERNEL+SVM CISFINDER 9
Generative : PWM Based approach ChIP-seq Peak Map for TF Genome ATATCGTATAACAATAACCGGGAACTAATAGC ¡ ATATCGTATCTAACAAATCCTACT ¡ 1 2 3 4 5 6 7 8 9 10 11 12 Position A 14 0 0 14 28 40 9 45 42 13 15 9 Weight T 12 3 4 12 11 10 9 6 5 38 12 3 Matrix C 3 0 1 8 2 2 36 2 2 0 1 0 G 0 1 0 16 10 1 2 3 2 0 7 11 Sequence Logo 10
Generative Approach : Output ? ? Genome ATATCGTATCTTTTAAACCGGGTTGGCCAATAGC ¡ ATATCGTATCTAAACCGCCCTACT ¡ 11 Source : http://www.cbil.upenn.edu/EpoDB/release/version_2.2/meme/meme-output.html#sample
Generative Approach: Limitations – Output: Long list of potential TFs – Work well for only well preserved motifs or large training datasets – PWMs for all ~2000 TFs not available – Lower prediction performance than discriminative approaches 12
Discriminative Approach : Output ? ? Genome ATATCGTATCTTTTAAACCGGGTTGGCCAATAGC ¡ ATATCGTATCTAAACCGCCCTACT ¡ Peak Genome ATATCGTATCTTTTAAACCGGGTTGGCCAATAGC ¡ ATATCGTATCTAAACCGCCCTACT ¡ +1 -1 13
Discriminative : String Kernel Approach Support Vector Machine 14
Discriminative Approach : Limitation Assumption: Training/test data follow same distribution regardless of context. 15
Aim • Improve prediction of Transcription Factor Binding sites across contexts using knowledge transfer. 16
Proposed Solution : Cross-Context Knowledge Transfer 17
Transfer String Kernel : Overview Xs Xt Target ATCGAT ATACAT Source Context GTATAC GCTTAC Context Feature Feature Conversion Conversion Knowledge Transfer (KMM) Training Classification 18
Outline • Method – String Kernel – Support Vector Machine – Transfer Learning (KMM) – Importance re-weighting – Transfer String Kernel • Evaluation – Experimental Setup – Cross-context TFBS prediction – Cross-context Protein Binding prediction 19
Outline • Method – String Kernel – Support Vector Machine – Transfer Learning (KMM) – Importance re-weighting – Transfer String Kernel • Evaluation – Experimental Setup – Cross-context TFBS prediction – Cross-context Protein Binding prediction 20
String Kernel : Spectrum Kernel Feature map indexed by all k-length subsequences (“k-mers”) from alphabet Σ of amino acids, | Σ |=20 21
String Kernel : Mismatch Kernel For k-mer s , the mismatch neighborhood N (k,m) (s) is the set of all k-mers t within m mismatches from s . 22
Outline • Method – String Kernel – Support Vector Machine – Transfer Learning (KMM) – Importance re-weighting – Transfer String Kernel • Evaluation – Experimental Setup – Cross-context TFBS prediction – Cross-context Protein Binding prediction 23
Support Vector Machine w . x + b = 0 w . x + b ≤ -1 w . x + b ≥ +1 Negative Instances (y = -1) Positive Instances (y = +1) 24
Outline • Method – String Kernel – Support Vector Machine – Transfer Learning (KMM) – Importance re-weighting – Transfer String Kernel • Evaluation – Experimental Setup – Cross-context TFBS prediction – Cross-context Protein Binding prediction 25
Transfer Learning (KMM) True Ratios densities r(x) p tr (x) r(x) p te (x) 26
Outline • Method – String Kernel – Support Vector Machine – Transfer Learning (KMM) – Importance re-weighting – Transfer String Kernel • Evaluation – Experimental Setup – Cross-context TFBS prediction – Cross-context Protein Binding prediction 27
Importance Re-weighting Original Weights KMM Weights 28
Outline • Method – String Kernel – Support Vector Machine – Transfer Learning (KMM) – Importance re-weighting – Transfer String Kernel • Evaluation – Experimental Setup – Cross-context TFBS prediction – Cross-context Protein Binding prediction 29
Transfer String Kernel (TSK) Target ! ATCGATCG CCCGATCG Source ! ATCGATCG% Context ! CTCGCTCC% Context ! Mismatch Mismatch Feature String Feature String Conversion ! Conversion ! Kernel Kernel Kernel Mean Knowledge Matching Transfer ! (KMM) Importance Re- Training ! Classification ! weighting SVM 30
Outline • Method – String Kernel – Support Vector Machine – Transfer Learning (KMM) – Importance re-weighting – Transfer String Kernel • Evaluation – Experimental Setup – Cross-context TFBS prediction – Cross-context Protein Binding prediction 31
Experimental Setup • 14 Transcription Factors (ENCODE ChIP-seq) • Top 1000 positive sequences (500 training and 500 testing) • 1000 random negative sequences • Hyper-parameter tuning for k=(8,10,12) and m=(1,2,3) • Dictionary size = 4 {A,T,C,G} 32
Outline • Method – String Kernel – Support Vector Machine – Transfer Learning (KMM) – Importance re-weighting – Transfer String Kernel • Evaluation – Experimental Setup – Cross-context TFBS prediction – Cross-context Protein Binding prediction 33
Results 34
Results – Cross Context 0.9 0.88 AUC Score 0.86 0.84 0.82 0.8 Sin3a Max Mxi1 Chd2 Ctcf Transcription Factors TSK SK 35
Outline • Method – String Kernel – Support Vector Machine – Transfer Learning (KMM) – Importance re-weighting – Transfer String Kernel • Evaluation – Experimental Setup – Cross-context TFBS prediction – Cross-context Protein Binding prediction 36
Results – Cross context 37
Summary • TSK overall improves the cross-context TFBS predictions; • String kernel based approaches perform better than the state-of- the-art Position Weight/Frequency Matrix based TFBS tools; • TSK approach is generalizable for performance improvement of any cross-context sequence prediction task. Presented in BIOKDD ’15 38
Acknowledgements Dr. Mazhar Adli Adli Lab : Department of Biochemistry and Molecular Genetics @Uva Nipun Batra IIIT-Delhi 39
Machine Learning Lab @ UVa Beilun Wang Dr. Yanjun Qi (Advisor) Weilin Xu Jack Lanchantin Ji Gao 40
Future Directions • Deep Learning : – Gene expression prediction using histone modification data (ECCB 2016) – Improving TFBS prediction using DNA sequences (ICLR Workshop 2016, ICML Workshop 2016) • String Kernels: Improving efficiency!! (on-going work) 41
Thank You 42
Recommend
More recommend