Data Mining in Bioinformatics Day 6: Classification in - - PowerPoint PPT Presentation

data mining in bioinformatics day 6 classification in
SMART_READER_LITE
LIVE PREVIEW

Data Mining in Bioinformatics Day 6: Classification in - - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt March 1 to March 12, 2010 Machine Learning & Computational Biology Research Group MPIs Tbingen Karsten Borgwardt: Data Mining in Bioinformatics, Page


slide-1
SLIDE 1

Karsten Borgwardt: Data Mining in Bioinformatics, Page 1

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics

Karsten Borgwardt March 1 to March 12, 2010 Machine Learning & Computational Biology Research Group MPIs Tübingen

slide-2
SLIDE 2

Protein function prediction via graph kernels

ISMB 2005 Joint work with Cheng Soon Ong and S.V.N. Vishwanathan, Stefan Schönauer, Hans-Peter Kriegel and Alex Smola Ludwig-Maximilians-Universität Munich, Germany and National ICT Australia, Canberra Karsten M. Borgwardt

slide-3
SLIDE 3

2

Content

Introduction

  • The problem: protein function prediction
  • The method: Support Vector Machines (SVM)

Our approach to function prediction

  • Protein graph model
  • Protein graph kernel
  • Experimental evaluation

Technique to analyze our graph model

  • Hyperkernels

Discussion

Karsten Borgwardt et al. - Protein function prediction via graph kernels

slide-4
SLIDE 4

3

Current approaches to protein function prediction

similar sequences similar function similar structures similar motifs similar chemical properties similar surface clefts similar phylogenetic profiles similar interaction partners

Karsten Borgwardt et al. - Protein function prediction via graph kernels

slide-5
SLIDE 5

4

Current approaches to protein function prediction

similar sequences similar function similar structures similar motifs similar chemical properties similar surface clefts similar phylogenetic profiles similar interaction partners

Karsten Borgwardt et al. - Protein function prediction via graph kernels

slide-6
SLIDE 6

5

Are new data points (x) red or black? The blue decision boundary allows to predict class membership

  • f new data points.

Support Vector Machines

Karsten Borgwardt et al. - Protein function prediction via graph kernels

slide-7
SLIDE 7

6

Kernel trick

mapping Ф kernel function The kernel trick allows to introduce a separating hyperplane in feature space. input space feature space

Karsten Borgwardt et al. - Protein function prediction via graph kernels

slide-8
SLIDE 8

7

protein structure and/or protein sequence e.g. Cai et al. (2004) , Dobson and Doig (2003)

  • hydrophobicity
  • polarity
  • polarizability
  • van der Waals volume
  • fraction of amino acid types
  • fraction of surface area
  • disulphide bonds
  • size of largest surface pocket

Feature vectors for function prediction

slide-9
SLIDE 9

8

Our approach

Sequence + Structure + Chemical properties Graph model

SVMs + Graph models Protein function

Karsten Borgwardt et al. - Protein function prediction via graph kernels

slide-10
SLIDE 10

9

protein secondary structure sequence structure

Protein graph model

Karsten Borgwardt et al. - Protein function prediction via graph kernels

slide-11
SLIDE 11

10

Node attributes

  • hydrophobicity
  • polarity
  • polarizability
  • van der Waals volume
  • length
  • helix, sheet, loop

Protein graph model

Edge attributes

  • type (sequence, structure)
  • length

Karsten Borgwardt et al. - Protein function prediction via graph kernels

slide-12
SLIDE 12

11

Protein graph kernel

(Kashima et al. (2003) and Gärtner et al. (2003))

compares walks of identical length l

Walks are similar, if along both walks

  • types of secondary structure elements (SSEs) are the same
  • distances between SSEs are similar
  • chemical properties of SSEs are similar

k walk v1,... ,vl,w1,... ,wl=∑

i =1 l −1

kstepvi ,vi1 ,wi ,wi1

slide-13
SLIDE 13

S S S S S S

12

Example: Protein kernel

Protein A Protein B

Similar (H,10,S,1,S,3,H) (H,9,S,1,S,3,H)

slide-14
SLIDE 14

13

Example: Protein kernel

Dissimilar (H,10,S,1,S) (S,3,H,5,S)

Protein A Protein B S S S S S S

slide-15
SLIDE 15

14

Kernel type accuracy SD Vector kernel 76.86 1.23 Optimized vector kernel 80.17 1.24 Graph kernel 77.30 1.20 Graph kernel without structure 72.33 5.32 Graph kernel with global info 84.04 3.33 DALI classifier 75.07 4.58

10-fold cross-validation on 1128 proteins from dataset by Dobson and Doig (2003); 59 % are enzymes.

Evaluation: enzymes vs. non-enzymes

Karsten Borgwardt et al. - Protein function prediction via graph kernels

slide-16
SLIDE 16

15

Which structural or chemical attribute is most important for correct classification? For this purpose, we employ hyperkernels (Ong et. al, 2003). Hyperkernels find an optimal linear combination of input kernel matrices :

minimizing training error and fulfilling regularization constraints

Attribute selection

Karsten Borgwardt et al. - Protein function prediction via graph kernels

i=1 m

β i K i

slide-17
SLIDE 17

16

Our approach:

  • Calculate kernel matrix for 600 proteins on

graph model with only ONE single attribute!

  • Repeat this for all attributes
  • Normalize these kernel matrices
  • Determine hyperkernel combination
  • Weights then reflect contribution of individual

attributes to correct classification

Attribute selection

slide-18
SLIDE 18

17

Attribute selection

0.00 0.13 0.00 0.00 0.01 0.00 Total Polarizability 0.00 0.01 0.00 0.00 0.14 0.00 Total Polarity 0.00 0.01 0.00 0.00 0.13 0.00 Total Hydrophobicity 0.00 0.00 0.00 0.00 0.00 0.00 Total van der Waals 0.00 0.00 0.00 0.00 0.40 0.00 3d length 0.00 0.12 0.00 0.00 0.00 0.00 3-bin Polarizability 1.00 0.00 0.00 0.00 0.01 0.00 3-bin Polarity 0.00 0.00 0.00 0.00 0.00 0.00 3-bin Hydrophobicity 0.00 0.00 0.00 0.00 0.00 0.00 3-bin van der Waals 0.00 0.73 1.00 1.00 0.31 1.00 Amino acid length EC 6 EC 5 EC 4 EC 3 EC 2 EC 1 Attribute

Karsten Borgwardt et al. - Protein function prediction via graph kernels

slide-19
SLIDE 19

18

Discussion

  • Novel combined approach to protein function prediction

integrating sequence, structure and chemical information

  • Reaches state-of-the-art classification accuracy on less

information; higher accuracy levels on same amount of information

  • Hyperkernels for finding most interesting protein

characteristics

Karsten Borgwardt et al. - Protein function prediction via graph kernels

slide-20
SLIDE 20

Karsten Borgwardt et al. - Protein function prediction via graph kernels 19

Discussion

  • More detailed graph models (amino acids, atoms) might be

more interesting, yet raise computational difficulties (graphs too large!) Two directions of future research:

  • Efficient, yet expressive graph kernels for structure
  • Integrating more proteomic information, e.g. surface

pockets, into our graph model

slide-21
SLIDE 21

Karsten Borgwardt et al. - Protein function prediction via graph kernels 20

The End

Thank you! Questions?

slide-22
SLIDE 22

ARTS: Accurate Recognition of Transcription Starts in human

  • ren Sonnenburg,

Alexander Zien,

∗,♮

Gunnar R¨ atsch♮

† Fraunhofer FIRST.IDA, Kekul´

  • estr. 7, 12489 Berlin, Germany

♮ Friedrich Miescher Laboratory of the Max Planck Society, ∗ Max Planck Institute for Biological Cybernetics,

  • Spemannstr. 37-39, 72076 T¨

ubingen, Germany Soeren.Sonnenburg@first.fraunhofer.de, {Alexander.Zien,Gunnar.Raetsch}@tuebingen.mpg.de

slide-23
SLIDE 23

Promoter Detection

Overview:

  • Transcription Start Site (TSS)
  • Features to describe the TSS
  • Our approach
  • Evaluation with current methods
  • Example - Protocadherin-α
  • Summary

Sonnenburg, Zien, R¨ atsch 1

slide-24
SLIDE 24

Promoter Detection

Transcription Start Site - Properties

  • POL II binds to a rather vague region of ≈ [−20, +20] bp
  • Upstream of TSS: promoter containing transcription factor binding sites
  • Downstream of TSS: 5’ UTR, and further downstream coding regions

and introns (different statistics)

  • 3D structure of the promoter must allow the transcription factors to bind

⇒ Promoter Prediction is non-trivial

Sonnenburg, Zien, R¨ atsch 2

slide-25
SLIDE 25

Promoter Detection

Features to describe the TSS

  • TFBS in Promoter region
  • condition: DNA should not be too twisted
  • CpG islands (often over TSS/first exon; in most, but not all promoters)
  • TSS with TATA box (≈ −30 bp upstream)
  • Exon content in UTR 5” region
  • Distance to first donor splice site

Idea: Combine weak features to build strong promoter predictor

Sonnenburg, Zien, R¨ atsch 3

slide-26
SLIDE 26

Promoter Detection

The ARTS Approach

  • use SVM classifier

f(x) = sign Ns

  • i=1

yiαik(x, xi) + b

  • key ingredient is kernel k(x, x′) — similarity of two sequences
  • use 5 sub-kernels suited to model the aforementioned features

k(x, x′) = kT SS(x, x′)+kCpG(x, x′)+kcoding(x, x′)+kenergy(x, x′)+ktwist(x, x′)

Sonnenburg, Zien, R¨ atsch 4

slide-27
SLIDE 27

Promoter Detection

The 5 sub-kernels

  • 1. TSS signal (including parts of core promoter with TATA box)

– use Weighted Degree Shift kernel

  • 2. CpG Islands, distant enhancers and TFBS upstream of TSS

– use Spectrum kernel (large window upstream of TSS)

  • 3. Model coding sequence TFBS downstream of TSS

– use another Spectrum kernel (small window downstream of TSS)

  • 4. Stacking energy of DNA

– use btwist energy of dinucleotides with Linear kernel

  • 5. Twistedness of DNA

– use btwist angle of dinucleotides with Linear kernel

Sonnenburg, Zien, R¨ atsch 5

slide-28
SLIDE 28

Promoter Detection

Weighted Degree Shift Kernel

k(x1,x2) = w6,3 + w6,-3 + w3,4

x1 x2

  • Count matching substrings of length 1 . . . d
  • Weight according to length of the match β1 . . . βd
  • Position dependent but tolerates “shifts” of up to S

k(x, x′) =

d

  • k=1

βk

L−k+1

  • l=1

S

  • s=0

s+l≤L

δs (I(x[k : l + s]=x′[k : l])+I(x[k : l]=x′[k : l + s])) x[k : l] := subsequence of x of length k starting at position l

Sonnenburg, Zien, R¨ atsch 6

slide-29
SLIDE 29

Promoter Detection

Training – Data Generation True TSS:

  • From dbTSSv4 (based on hg16) extract putative TSS windows of size

[−1000, +1000]

Decoy TSS:

  • Annotate dbTSSv4 with transcription-stop (via BLAT alignment of

mRNAs)

  • From the interior of the gene (+100bp to gene end) sample negatives for

training (10 per positive), again windows [−1000, +1000]

Processing:

  • 8508 positive, 85042 negative examples
  • Split into disjoint training and validation set (50% : 50%)

Sonnenburg, Zien, R¨ atsch 7

slide-30
SLIDE 30

Promoter Detection

Training – Model Selection 16 kernel parameters + SVM regularization to be tuned!

  • Full grid search infeasible
  • Local axis-parallel searches instead

SVM training/evaluation on > 10, 000 examples computationally too demanding

Speedup trick:

f(x) =

Ns

  • i=1

αik(xi, x) + b =

Ns

  • i=1

αiΦ(xi)

  • w

·Φ(x) + b = w · Φ(x) + b f(x) before: O(NsdLS) now: = O(dL) ⇒ speedup factor up to Ns · S

⇒ Large Scale Training and Evaluation possible

Sonnenburg, Zien, R¨ atsch 8

slide-31
SLIDE 31

Promoter Detection

Comparison Current state-of-the-art methods:

  • FirstEF [Davuluri, Grosse, Zhang; 2001, Nat Genet]

QDF: for promoter, donor, first exon, WM Range: [−1500, +500]

  • McPromoter [Ohler, Liao, Niemann, Rubin; 2002, Genome Biol]

GHMM with IMC for 6 regions (e.g. upstream, TATA) NN Range: [−250, +50]

  • Eponine [Down, Hubbard; 2002 Genome Res]

RVM: WM with positional distribution for 4 regions (e.g. TATA, CpG) Range: [−200, +200]

⇒ Do a genome wide evaluation! ⇒ How to do a fair comparison?

Sonnenburg, Zien, R¨ atsch 9

slide-32
SLIDE 32

Promoter Detection

Evaluation Idea: Only consider “new” TSS from dbTSSv5-dbTSSv4, with max 30%

  • verlap
  • 1. Compute genome wide outputs for each TSF
  • 2. Decrease resolution: divide genome into non-overlapping fixed size chunks

(e.g. 50 or 500)

  • 3. Annotate dbTSSv5 TSS with gene end
  • 4. Label chunk positive if intersects with [TSS − 20bp, TSS + 20bp]
  • 5. Label chunk negative [TSS + 21bp, GeneEnd]

Sonnenburg, Zien, R¨ atsch 10

slide-33
SLIDE 33

Promoter Detection

Results

Receiver Operator Characteristic Curve and Precision Recall Curve

⇒ 35% true positives at a false positive rate of 1/1000 (best other method find about a half (18%))

Sonnenburg, Zien, R¨ atsch 11

slide-34
SLIDE 34

Promoter Detection

What does ARTS do better ?

Entropy and Relative Entropy

500 1000 1500 2000 2500 4.9 5 5.1 5.2 5.3 5.4 5.5

entropy auROC: 86.5% auPRC: 49.8% entropy auROC: 86.5% auPRC: 49.8%

500 1000 1500 2000 2500 4.9 5 5.1 5.2 5.3 5.4 5.5 500 1000 1500 2000 2500 4.9 5 5.1 5.2 5.3 5.4 5.5

relative entropy auROC: 86.5% auPRC: 49.8%

Di-nucleotide Frequency

⇒ strong discriminative signal around TSS

Sonnenburg, Zien, R¨ atsch 12

slide-35
SLIDE 35

Promoter Detection

Which kernel captures most information ?

TSS WD shift Promotor Spectrum 1st Exon Spectrum Angles Linear 80 82 84 86 88 90 92 94 96

using or removing single kernels area under ROC Curve (in %)

⇒ Most important Weighted Degree Shift kernel modelling the TSS signal

Sonnenburg, Zien, R¨ atsch 13

slide-36
SLIDE 36

Promoter Detection

Alternative TSS - Protocadherin-α

Sonnenburg, Zien, R¨ atsch 14

slide-37
SLIDE 37

Promoter Detection

Conclusion

  • Developed a new TSF finder, “ARTS”
  • In genome-wide evaluation achieves state-of-the-art results: ARTS about

35% true positives at a false positive rate of 1/1000 (best other method about a half, 18%)

  • Reason:

intensively modelling the TSS region, large scale svm training/evaluation with string kernels

  • Future work: Drosophila, C.elegans, Zebrafish,. . .

Poster: H56 Datasets, Genomebrowser custom track, a lot more details: http://www.fml.tuebingen.mpg.de/raetsch/projects/arts Source code of SHOGUN toolbox used to train ARTS freely available: http://www.fml.tuebingen.mpg.de/raetsch/projects/shogun

Sonnenburg, Zien, R¨ atsch 15

slide-38
SLIDE 38

The end

Karsten Borgwardt: Data Mining in Bioinformatics, Page 2

See you tomorrow! Next topic: Clustering in Bioinformatics