A method for similarity-based grouping of biological data Vaida Jakonien ė , David Rundqvist, Patrick Lambrix
Outline � Environments for supporting grouping algorithms needed � Method for similarity based grouping � Test cases � Summary and future work V. Jakonien ė , D. Rundqvist, P. Lambrix. Linköpings universitet, Sweden 2
Tools for biological data analysis Hierarchical microarray clustering (J-Express Pro) Classification of abstracts V. Jakonien ė , D. Rundqvist, P. Lambrix. Linköpings universitet, Sweden 3
Tools for biological data analysis � Other applications of grouping � structuring search results � data cleaning � data integration V. Jakonien ė , D. Rundqvist, P. Lambrix. Linköpings universitet, Sweden 4
Similarity of biological data Similarity between data entries Lord PW, Stevens RD, Brass A, Goble CA. Sequence alignment (BLAST) Bioinformatics, 19(10):1275-83, 2003. � Basic task – computation of a similarity value between objects V. Jakonien ė , D. Rundqvist, P. Lambrix. Linköpings universitet, Sweden 5
Similarity-based grouping � Similarity-based grouping for biological data needed � Not a trivial task � influence of a number of aspects � data is complex � variety of grouping algorithms is available: which method performs best for which grouping task � existing grouping algorithms may not be applied straightforward V. Jakonien ė , D. Rundqvist, P. Lambrix. Linköpings universitet, Sweden 6
Similarity-based grouping � Environments that support comparison and evaluation of different grouping strategies are needed V. Jakonien ė , D. Rundqvist, P. Lambrix. Linköpings universitet, Sweden 7
Method for similarity-based grouping Domain independent Domain dependent Grouping sim. funct. sim. funct. attributes Specification of Data source Library of grouping rules similarity funct. Pairwise grouping Other knowledge Grouping Evaluation Library of classifications Analysis V. Jakonien ė , D. Rundqvist, P. Lambrix. Linköpings universitet, Sweden 8
� A toolKit for Evaluating Grouping Algorithms V. Jakonien ė , D. Rundqvist, P. Lambrix. Linköpings universitet, Sweden 9
Test cases � Grouping task. Grouping of proteins with respect to � biological function � class of isozymes they belong to � Data source � human proteins involved in glycolysis � via Entrez retrieved 190 data entries V. Jakonien ė , D. Rundqvist, P. Lambrix. Linköpings universitet, Sweden 10
Test cases. Data entry Entrez. Protein database V. Jakonien ė , D. Rundqvist, P. Lambrix. Linköpings universitet, Sweden 11
Test cases. Data entry V. Jakonien ė , D. Rundqvist, P. Lambrix. Linköpings universitet, Sweden 12
Test cases. Data entry GO ann Sequence V. Jakonien ė , D. Rundqvist, P. Lambrix. Linköpings universitet, Sweden 13
Test cases. Data sources and mappings - only terms of GO function ontology analyzed GO ann , 67 data entries DS1: - only data entries having GO terms GO Consortium. Mappings between data values and ontological terms: ec2go – ec_numbers translated into GO terms spkw2go – swissprot keywords translated into GO terms DS2: spkw2go Keywords ec2go GO comb , 93 data entries Ec_number GO ann DS3: ec2go Ec_number GO comb , 92 data entries GO ann V. Jakonien ė , D. Rundqvist, P. Lambrix. Linköpings universitet, Sweden 14
Test cases. Other components Domain dependent Grouping Domain independent sim. funct. attributes sim. funct. Specification of � Library of similarity Data source grouping rules Library of functions similarity funct. Pairwise grouping � EditDist(v 1 ,v 2 ) Other knowledge � SeqSim(v 1 ,v 2 ) Grouping � SemSim(v 1 ,v 2 ) � Other knowledge Evaluation Library of classifications � GO ontology Analysis � Classifications. Manual classification according to � biological function � classes of isozymes V. Jakonien ė , D. Rundqvist, P. Lambrix. Linköpings universitet, Sweden 15
Specification of grouping rules Method. Specification of grouping rules Pairwise grouping Grouping (DS3) Evaluation Analysis V. Jakonien ė , D. Rundqvist, P. Lambrix. Linköpings universitet, Sweden 16
Specification of grouping rules Method. Pairwise grouping Pairwise grouping Grouping Evaluation Analysis all pairs of data entries compared (DS3) V. Jakonien ė , D. Rundqvist, P. Lambrix. Linköpings universitet, Sweden 17
Specification of grouping rules Method. Grouping Pairwise grouping Grouping Evaluation data entries in a group directly or transitively Analysis similar to each other (ConnectedComponents) all data entries in a group similar to each other (Cliques) V. Jakonien ė , D. Rundqvist, P. Lambrix. Linköpings universitet, Sweden 18
Specification of grouping rules Method. Grouping Pairwise grouping Grouping Evaluation Analysis V. Jakonien ė , D. Rundqvist, P. Lambrix. Linköpings universitet, Sweden 19
Specification of grouping rules Method. Evaluation Pairwise grouping Grouping Evaluation � Types of quality measures Analysis � internal – based on information obtained during the grouping � external – with respect to known classes of the grouped data V. Jakonien ė , D. Rundqvist, P. Lambrix. Linköpings universitet, Sweden 20
Specification of grouping rules Method. Analysis Pairwise grouping Grouping Evaluation Analysis true positives false positives false negatives V. Jakonien ė , D. Rundqvist, P. Lambrix. Linköpings universitet, Sweden 21
Method. Analysis V. Jakonien ė , D. Rundqvist, P. Lambrix. Linköpings universitet, Sweden 22
Method. Analysis � Studied aspects, e.g. use of different data sources, grouping algorithms, and classifications, grouping on different attributes, impact of threshold V. Jakonien ė , D. Rundqvist, P. Lambrix. Linköpings universitet, Sweden 23
Test cases. Observations � Best suited grouping approaches. For data source Glyc-Funct-AnnEc-onlyGO (DS3) � SemSim(GOcomb) for grouping on biological function � SeqSim(Sequence) for grouping on classes of isozymes � Suitability of mappings for the used grouping approches � spkw2go – too general, e.g. ’Glycolysis’ � ec2go – specific enough, e.g. ’6-phosphofructokinase activity’ V. Jakonien ė , D. Rundqvist, P. Lambrix. Linköpings universitet, Sweden 24
Summary and future work � Motivated need for environments that support the development and evaluation of similarity-based grouping procedures � Proposed a method that identifies the main components and steps that are importan for such environments. � Illustrated the grouping method by test cases based on different strategies and classifications � Extend the Kitega implementation V. Jakonien ė , D. Rundqvist, P. Lambrix. Linköpings universitet, Sweden 25
Recommend
More recommend