Marinka Ž itnik & Bla ž Zupan University of Ljubljana, Slovenia Large-Scale Data Fusion Tutorial at BC^2, Basel 2015 by Collective Matrix Factorization
Tutorial Overview ❖ Motivation: search for bacterial-response genes, a case study ❖ Warm up: recommender systems ❖ Data fusion: tri-factorization and sharing of latent features ❖ Examples: movies & genes ❖ Hands-on: visual programming and scripting ❖ Applications: case studies in prioritization and classification ❖ Other approaches: related work
BC2 Tutorial on Large-scale data fusion by collective matrix factorization Search of bacterial response Motivation genes in a social amoeba Dictyostelium discoideum.
Motivation: Bacterial-Response Genes in Dicty Adam Kuspa Baylor College of Medicine, Gad Shaulsky Houston, USA Dictyostelium discoideum
Search for Bacterial Response Genes Dicty is 50,000 clonal mutants bacterial predator! genetic screen Gram+ defective: genome 12,000 genes swp1, gpi, nagB1 found 7 genes Gram- defective: workload 5 years clkB, spc3, alyL, nip7 estimated ~200 genes
Now What? 50% coverage (100 genes) 20 screens required! 80% coverage (160 genes) 65 screens required!
Alternative: A Data-Driven Approach Genes Phenotypes Mutant nagB1 swp1 shkA spc3 nip7 alyL kif9 gpi Gram neg. defective Aberrant spore color Aberrant spore color Decreased chemotaxis Gram pos.defective
So Much More Data … Decreased chemotaxis Aberrant spore color Gram neg. defective Aberrant spore color Gram pos.defective Phenotype Ontology Mutant Phenotypes Timepoints Publications spc3 swp1 Pubmed kif9 data alyL nagB1 Genes gpi shkA nip7 MeSH terms Expression Phenotype data data MeSH terms MeSH Ontology MeSH annotations
Actually, in Biomedicine, Matrices Abound … V5 Normalized counts Experiments V4 T1 T2 T3 T4 T5 T6 T7 V3 V2 Genes V1 T1 T2 T3 T4 T5 T6 T7 Timepoints Gene Expression
Actually, in Biomedicine, Matrices Abound … gemA gacT racM xacA racN racJ rdiA racI gemA gacT gacT rdiA gemA racN rdiA racJ racN racM racJ racI racI xacA xacA racM Gene Interactions
Actually, in Biomedicine, Matrices Abound … Part of N-Glycan biosynthesis pathway Part of N-Glycan biosynthesis pathway alg13 alg7 alg1 alg2 alg14 Pathways Fructose and mannose metabolism alg11 Genes dpm1 dpm2 dpm3 alg3 alg12 alg9 GPI-anchor biosynthesis Gene Pathways
Actually, in Biomedicine, Matrices Abound … Part of Gene Ontology graph Gene Ontology terms Response to Response to Response s external biotic m o to stress s t e i stimulus stimulus e s n n s a s n o g s o p r e o p s r e s t r s e e R Response to R h o t t o external biotic stimulus Gene Ontology terms Response Defense Response to to bacterium response other organisms Defense Response to Defense response to bacterium response other organism Defense response to bacterium Ontologies
Actually, in Biomedicine, Matrices Abound … Medical Subject Headings MeSH terms Cell separation Literature Cytoplasmic vesicles/metabolism Ethidium/metabolism Immunity/innate Mutation Phagocytes/cytology Phagocytes/immunology* Phagocytosis* Ontologies of Controlled Vocabularies
Actually, in Biomedicine, Matrices Abound … alg13 alg7 alg1 alg14 Ontology terms Part of N-Glycan biosynthesis pathway Fructose and mannose metabolism dpm1 dpm2 dpm3 Pathways Protein N-linked glycosylation (GO:0006487) Orthology Ontology Dolichol kinase (K00902) GO:0004168 Alpha-mannosidase II (K01231) GO:0004572 Oligosaccharyltransferase complex (K12668) GO:0008250 Cross-Links of Controlled Entities
BC2 Tutorial on Large-scale data fusion by collective matrix factorization Working with matrices: Warm-Up recommender systems, two- factorization and tri- factorization.
Back to the Problem: Recommender Systems Genes Phenotypes Mutant nagB1 swp1 shkA spc3 nip7 alyL kif9 gpi Gram neg. defective Aberrant spore color Aberrant spore color Decreased chemotaxis Gram pos.defective Users Morgan Jenny Jerry John Mike Alex Kate Nick Passengers War of the Worlds Bride Wars The Matrix Reloaded Movies The Godfather The Dark Knight Pulp fiction Schindler’s List
Recommender Systems
Netflix Prize, 2009
Example: A Small Relation Matrix d e d s a d o l l r e o R W s s r x r e e a i r g h W t a n t e M f User e o s d s e r i a a r h B W P T John 2 3 Kate 5 4 Alex 4 5 Mike 4 5 Movie
Matrix Two-Factorization d e d s a d o l l r e o R W s s r L1 L2 x r e e a i r g h W t a n t M e John f e 0.2 0 o s d s e r i a a r h B W P T Kate 0 0.5 Alex 6.3 0 1.1 8 0.5 0 L1 L2 Mike 0 0.5 3.9 10.7 0 3.3
Matrix Two-Factorization The Matrix Reloaded War of the Worlds Passengers Bride Wars John 2 3 1.3 0 0.2 1.6 Kate 5 4 2 5.4 0 1.7 ~ ~ 4 5 3.2 0 0.6 4 Alex 4 5 2 5.4 0 1.7 Mike
Matrix Tri-Factorization The Matrix Reloaded War of the Worlds Passengers Bride Wars U1 U2 John 0.2 0.3 Kate M1 M2 0.8 0.2 Alex 0.9 0.2 0.2 1 0.7 1.2 -4.4 9.1 M1 U1 6.7 -5.8 Mike 0.6 0.8 0.1 0.7 0.8 1.2 U2 M2
Matrix Tri-Factorization The Matrix Reloaded War of the Worlds Passengers Bride Wars John 2 3 1.1 0.3 0.2 1.2 Kate 5 4 1.7 4.5 0.2 2.1 ~ ~ 4 5 4.1 0.5 0.9 4.5 Alex 4 5 1.5 4.8 0.1 1.9 Mike
Hands-On: Matrix Tri-Factorization
BC2 Tutorial on Large-scale data fusion by collective matrix factorization Collective matrix Data Fusion tri-factorization. Latent factor sharing.
Data Fusion Graph: One Data Matrix A A Backbone B Recipe Tri-factorization matrix of B matrix of A of matrix A-B A-B Recipe matrix of B ~ x x = ~ Reconstructed matrix A-B
Data Fusion Graph: Two Data Matrices A A E Backbone B E Recipe matrix of Backbone Collective tri-factorization B matrix of A A-E matrix of of matrices A-B and A-E A-E Recipe matrix of E Recipe matrix of B ~ x x = ~ Reconstructed matrix A-B ~ x x = ~ Reconstructed matrix A-E
Data Fusion Graph: All Together Now A E B D C F G
Sharing of Latent Matrices A E B D C F G
· · · ⇤ R 12 R 1 r 2 { · · · ⇤ R 21 R 2 r Θ ( t ) = Diag( Θ ( t ) 1 , Θ ( t ) R = 2 , . . . , Θ ( t ) . . . r ) ... . . . . . . · · · ⇤ R r 1 R r 2 S k 1 × k 2 S k 1 × k r 2 3 ⇤ · · · 12 1 r S k 2 × k 1 S k 2 × k r · · · ⇤ 6 7 21 2 r G = Diag( G n 1 × k 1 , G n 2 × k 2 , . . . , G n r × k r ) , S = 6 7 . . . ... 1 2 r . . . 6 7 . . . 4 5 S k r × k 1 S k r × k 2 · · · ⇤ r 1 r 2 j || 2 + X || R ij � G i S ij G T G ≥ 0 J ( G ; S ) min = R ij ∈ R max i t i tr ( G T Θ ( t ) G ) , X + t =1
Input: A set R of relation matrices R ij ; constraint matrices Θ ( t ) for t 2 { 1 , 2 , . . . , max i t i } ; ranks k 1 , k 2 , . . . , k r ( i, j 2 [ r ] ). Output: Matrix factors S and G . 1) Initialize G i for i = 1 , 2 , . . . , r. 2) Repeat until convergence: • Construct R and G using their definitions in Eq. (1) and Eq. (3). • Update S using: S ( G T G ) − 1 G T RG ( G T G ) − 1 . (8) • Set G ( e ) 0 for i = 1 , 2 , . . . , r . i • Set G ( d ) 0 for i = 1 , 2 , . . . , r . i • For R ij 2 R : ij ) + + G i ( S ij G T G ( e ) ( R ij G j S T j G j S T ij ) − += i ij ) − + G i ( S ij G T G ( d ) ij ) + ( R ij G j S T j G j S T += i ij G i S ij ) + + G j ( S T G ( e ) ( R T ij G T i G i S ij ) − += j ij G i S ij ) − + G j ( S T i G i S ij ) + (10) G ( d ) ( R T ij G T += j • For t = 1 , 2 , . . . , max i t i : G ( e ) [ Θ ( t ) ] − G i += for i = 1 , 2 , . . . , r i i G ( d ) [ Θ ( t ) ] + G i += for i = 1 , 2 , . . . , r (11) i i • Construct G as: v v v t G ( e ) t G ( e ) t G ( e ) u u u u r u 1 u 2 G G � Diag( ) , (12) , , . . . , G ( d ) G ( d ) G ( d ) r 1 2 where � denotes the Hadamard product. The p · and · are · entry-wise operations.
Hands-On: Collective Matrix Factorization
BC2 Tutorial on Large-scale data fusion by collective matrix factorization Scoring Data sampling. Completion scoring.
Data Sampling & Completion
Hands-On: Performance Evaluation
Hands-On: Latent Profiling
BC2 Tutorial on Large-scale data fusion by collective matrix factorization The Yeast A large data compendium. Case Study Functional genomics.
The Yeast Case Study Topics of yeast biology Literature Literature Topic Literature on functions and processes e r u t a r Genetic and e t i l physical e n Ontology Experiment e interactions G structure Gene expression data Gene Gene Gene annotation data Ontology Term Gene pathway data Biochemical pathway
Hands-On: The Yeast Case Study
Latent Matrix Chaining Topics of yeast biology Literature Literature Topic Gene literature Chain of latent matrices Gene Literature topic Gene = x = x Gene Literature topic Gene profile matrix
Hands-On: Latent Matrix Chaining
Recommend
More recommend