Detecting Outliers with Ensemble of Profile HMMs Xilin Yu 1 UIUC December 11, 2018 1 under the supervision of Tandy Warnow Xilin Yu (UIUC) Detecting Outliers Dec 11, 2018 1 / 19
Table of Contents Introduction 1 Method Overview 2 Experiments 3 Data Sets Methods and Parameters Evaluation Results 4 Summary and Future Work 5 Xilin Yu (UIUC) Detecting Outliers Dec 11, 2018 2 / 19
Table of Contents Introduction 1 Method Overview 2 Experiments 3 Data Sets Methods and Parameters Evaluation Results 4 Summary and Future Work 5 Xilin Yu (UIUC) Detecting Outliers Dec 11, 2018 3 / 19
Introduction Problem Given a set of sequences in which most of the sequences are homologous to each other, detect the few (say ≤ 5% ) outliers. Outlier: a sequence not homologous to the majority of sequences Harm of outliers in a set of sequences: propagation and magnification of error Difficulty: homology defined in terms of evolutionary history No ground truth Almost ground truth by human expert hard to summarize into step by step algorithm Xilin Yu (UIUC) Detecting Outliers Dec 11, 2018 4 / 19
Introduction Different approaches for different goals: (Treeshrink) unexpected long branch: decrease gene tree discordance for better species tree (OD-seq) distance metric using gappiness of alignment: reduce under-alignment level (EDM) edit distance: increase proximity of sequences Xilin Yu (UIUC) Detecting Outliers Dec 11, 2018 5 / 19
Table of Contents Introduction 1 Method Overview 2 Experiments 3 Data Sets Methods and Parameters Evaluation Results 4 Summary and Future Work 5 Xilin Yu (UIUC) Detecting Outliers Dec 11, 2018 6 / 19
Method Overview: Ensemble of Profile HMMs Key Ideas Random sample unlikely to contain any outlier Profile HMM on sample generates outliers with low probability Profile HMM more accurate on closely related subset of sequences: build HMMs on a hierarchy of subsets of sample Multiple independent runs reduces miss on outliers Xilin Yu (UIUC) Detecting Outliers Dec 11, 2018 7 / 19
Method Flow Chart Xilin Yu (UIUC) Detecting Outliers Dec 11, 2018 8 / 19
Method Flow Chart Figure: Flowchart of HIPPI Xilin Yu (UIUC) Detecting Outliers Dec 11, 2018 9 / 19
Table of Contents Introduction 1 Method Overview 2 Experiments 3 Data Sets Methods and Parameters Evaluation Results 4 Summary and Future Work 5 Xilin Yu (UIUC) Detecting Outliers Dec 11, 2018 10 / 19
Experiments Experiment Design Seeds of a pfam protein family: human curated, taken as ground truth Artificial outliers: seeds from other families Evaluation: TP, FN and FP Precision: TP/(TP+FP) Recall: TP/(TP+FN) Xilin Yu (UIUC) Detecting Outliers Dec 11, 2018 11 / 19
Data Sets Families with at least 100 sequences (around 3500 families) Divide into families of small size ( < = 200) and large size ( > 200) Divide into families of short, medium, and long sequence length For each family A, randomly choose family B as source of ourliers (average length between 50% to 200% of A) Uniformly random seq from B s.t E [num outliers] = 5% | A | Each main + outlier family pair, 3 independent experiments Xilin Yu (UIUC) Detecting Outliers Dec 11, 2018 12 / 19
Methods and Parameters Ensemble method: sampling probability = 0.1, number of trials = 3 (expected outliers in sample << 1) MAFFT: default FastTree2: default HIPPI: decomp size = { 10 , 12 , 15 , 20 } , min p-dist = default outliers: sequences unmatched by HIPPI Xilin Yu (UIUC) Detecting Outliers Dec 11, 2018 13 / 19
Methods and Parameters Edit distance method: sampling probability = 0.05, number of std ( ℓ ) = { 1 , 2 , 3 } d(x): average edit distance between the sample and x outliers: sequences with d(x) at least mean + ℓ * std OD-seq: threshold = { 0 . 001 , 0 . 01 , 0 . 02 } outliers: sequences that are above threshold in distribution of distance scores Xilin Yu (UIUC) Detecting Outliers Dec 11, 2018 14 / 19
Table of Contents Introduction 1 Method Overview 2 Experiments 3 Data Sets Methods and Parameters Evaluation Results 4 Summary and Future Work 5 Xilin Yu (UIUC) Detecting Outliers Dec 11, 2018 15 / 19
Evaluation Precision Recall Running time Ensemble, 10 0.871 0.955 Ensemble, 12 0.855 0.954 < = 0 . 2 s Ensemble, 15 0.841 0.948 Ensemble, 20 0.832 0.959 OD-seq, 0.001 0.430 0.995 OD-seq, 0.01 0.386 0.997 < = 0 . 02 s OD-seq, 0.02 0.371 0.998 Table: Averaged result of the ensemble method and OD-seq on the protein families with 100 - 200 seed sequences with average length 100 - 200 Xilin Yu (UIUC) Detecting Outliers Dec 11, 2018 16 / 19
Table of Contents Introduction 1 Method Overview 2 Experiments 3 Data Sets Methods and Parameters Evaluation Results 4 Summary and Future Work 5 Xilin Yu (UIUC) Detecting Outliers Dec 11, 2018 17 / 19
Summary Summary: Ensemble method has much higher precision than OD-seq. Both have over 90% recall and OD-seq does slightly better. Both ensemble method and OD-seq are efficient while edit-distance is much slower. For ensemble method, best parameter seems to be 10 Xilin Yu (UIUC) Detecting Outliers Dec 11, 2018 18 / 19
Future work Future work: Add edit distance method to comparison Compare to the method of using one HMM Run on all groups of data Differentiate between in-clan and out-clan outliers More evaluation criteria Xilin Yu (UIUC) Detecting Outliers Dec 11, 2018 19 / 19
Recommend
More recommend