Karsten Borgwardt: Data Mining in Bioinformatics, Page 1
Data Mining in Bioinformatics Day 6: Classification in Bioinformatics
Karsten Borgwardt March 1 to March 12, 2010 Machine Learning & Computational Biology Research Group MPIs Tübingen
Data Mining in Bioinformatics Day 6: Classification in - - PowerPoint PPT Presentation
Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt March 1 to March 12, 2010 Machine Learning & Computational Biology Research Group MPIs Tbingen Karsten Borgwardt: Data Mining in Bioinformatics, Page
Karsten Borgwardt: Data Mining in Bioinformatics, Page 1
Karsten Borgwardt March 1 to March 12, 2010 Machine Learning & Computational Biology Research Group MPIs Tübingen
2
Karsten Borgwardt et al. - Protein function prediction via graph kernels
3
Karsten Borgwardt et al. - Protein function prediction via graph kernels
4
Karsten Borgwardt et al. - Protein function prediction via graph kernels
5
Karsten Borgwardt et al. - Protein function prediction via graph kernels
6
Karsten Borgwardt et al. - Protein function prediction via graph kernels
7
8
Karsten Borgwardt et al. - Protein function prediction via graph kernels
9
Karsten Borgwardt et al. - Protein function prediction via graph kernels
10
Node attributes
Edge attributes
Karsten Borgwardt et al. - Protein function prediction via graph kernels
11
(Kashima et al. (2003) and Gärtner et al. (2003))
k walk v1,... ,vl,w1,... ,wl=∑
i =1 l −1
kstepvi ,vi1 ,wi ,wi1
12
13
14
Kernel type accuracy SD Vector kernel 76.86 1.23 Optimized vector kernel 80.17 1.24 Graph kernel 77.30 1.20 Graph kernel without structure 72.33 5.32 Graph kernel with global info 84.04 3.33 DALI classifier 75.07 4.58
Karsten Borgwardt et al. - Protein function prediction via graph kernels
15
Karsten Borgwardt et al. - Protein function prediction via graph kernels
i=1 m
16
17
0.00 0.13 0.00 0.00 0.01 0.00 Total Polarizability 0.00 0.01 0.00 0.00 0.14 0.00 Total Polarity 0.00 0.01 0.00 0.00 0.13 0.00 Total Hydrophobicity 0.00 0.00 0.00 0.00 0.00 0.00 Total van der Waals 0.00 0.00 0.00 0.00 0.40 0.00 3d length 0.00 0.12 0.00 0.00 0.00 0.00 3-bin Polarizability 1.00 0.00 0.00 0.00 0.01 0.00 3-bin Polarity 0.00 0.00 0.00 0.00 0.00 0.00 3-bin Hydrophobicity 0.00 0.00 0.00 0.00 0.00 0.00 3-bin van der Waals 0.00 0.73 1.00 1.00 0.31 1.00 Amino acid length EC 6 EC 5 EC 4 EC 3 EC 2 EC 1 Attribute
Karsten Borgwardt et al. - Protein function prediction via graph kernels
18
Karsten Borgwardt et al. - Protein function prediction via graph kernels
Karsten Borgwardt et al. - Protein function prediction via graph kernels 19
Karsten Borgwardt et al. - Protein function prediction via graph kernels 20
S¨
†
Alexander Zien,
∗,♮
Gunnar R¨ atsch♮
† Fraunhofer FIRST.IDA, Kekul´
♮ Friedrich Miescher Laboratory of the Max Planck Society, ∗ Max Planck Institute for Biological Cybernetics,
ubingen, Germany Soeren.Sonnenburg@first.fraunhofer.de, {Alexander.Zien,Gunnar.Raetsch}@tuebingen.mpg.de
Promoter Detection
Sonnenburg, Zien, R¨ atsch 1
Promoter Detection
and introns (different statistics)
Sonnenburg, Zien, R¨ atsch 2
Promoter Detection
Sonnenburg, Zien, R¨ atsch 3
Promoter Detection
f(x) = sign Ns
yiαik(x, xi) + b
k(x, x′) = kT SS(x, x′)+kCpG(x, x′)+kcoding(x, x′)+kenergy(x, x′)+ktwist(x, x′)
Sonnenburg, Zien, R¨ atsch 4
Promoter Detection
– use Weighted Degree Shift kernel
– use Spectrum kernel (large window upstream of TSS)
– use another Spectrum kernel (small window downstream of TSS)
– use btwist energy of dinucleotides with Linear kernel
– use btwist angle of dinucleotides with Linear kernel
Sonnenburg, Zien, R¨ atsch 5
Promoter Detection
k(x, x′) =
d
βk
L−k+1
S
s+l≤L
δs (I(x[k : l + s]=x′[k : l])+I(x[k : l]=x′[k : l + s])) x[k : l] := subsequence of x of length k starting at position l
Sonnenburg, Zien, R¨ atsch 6
Promoter Detection
[−1000, +1000]
mRNAs)
training (10 per positive), again windows [−1000, +1000]
Sonnenburg, Zien, R¨ atsch 7
Promoter Detection
SVM training/evaluation on > 10, 000 examples computationally too demanding
f(x) =
Ns
αik(xi, x) + b =
Ns
αiΦ(xi)
·Φ(x) + b = w · Φ(x) + b f(x) before: O(NsdLS) now: = O(dL) ⇒ speedup factor up to Ns · S
Sonnenburg, Zien, R¨ atsch 8
Promoter Detection
QDF: for promoter, donor, first exon, WM Range: [−1500, +500]
GHMM with IMC for 6 regions (e.g. upstream, TATA) NN Range: [−250, +50]
RVM: WM with positional distribution for 4 regions (e.g. TATA, CpG) Range: [−200, +200]
Sonnenburg, Zien, R¨ atsch 9
Promoter Detection
(e.g. 50 or 500)
Sonnenburg, Zien, R¨ atsch 10
Promoter Detection
Receiver Operator Characteristic Curve and Precision Recall Curve
Sonnenburg, Zien, R¨ atsch 11
Promoter Detection
Entropy and Relative Entropy
500 1000 1500 2000 2500 4.9 5 5.1 5.2 5.3 5.4 5.5
entropy auROC: 86.5% auPRC: 49.8% entropy auROC: 86.5% auPRC: 49.8%
500 1000 1500 2000 2500 4.9 5 5.1 5.2 5.3 5.4 5.5 500 1000 1500 2000 2500 4.9 5 5.1 5.2 5.3 5.4 5.5
relative entropy auROC: 86.5% auPRC: 49.8%
Di-nucleotide Frequency
Sonnenburg, Zien, R¨ atsch 12
Promoter Detection
TSS WD shift Promotor Spectrum 1st Exon Spectrum Angles Linear 80 82 84 86 88 90 92 94 96
using or removing single kernels area under ROC Curve (in %)
Sonnenburg, Zien, R¨ atsch 13
Promoter Detection
Sonnenburg, Zien, R¨ atsch 14
Promoter Detection
35% true positives at a false positive rate of 1/1000 (best other method about a half, 18%)
intensively modelling the TSS region, large scale svm training/evaluation with string kernels
Poster: H56 Datasets, Genomebrowser custom track, a lot more details: http://www.fml.tuebingen.mpg.de/raetsch/projects/arts Source code of SHOGUN toolbox used to train ARTS freely available: http://www.fml.tuebingen.mpg.de/raetsch/projects/shogun
Sonnenburg, Zien, R¨ atsch 15
Karsten Borgwardt: Data Mining in Bioinformatics, Page 2
See you tomorrow! Next topic: Clustering in Bioinformatics