large scale data fusion
play

Large-Scale Data Fusion Tutorial at BC^2, Basel 2015 by Collective - PowerPoint PPT Presentation

Marinka itnik & Bla Zupan University of Ljubljana, Slovenia Large-Scale Data Fusion Tutorial at BC^2, Basel 2015 by Collective Matrix Factorization Tutorial Overview Motivation: search for bacterial-response genes, a case study


  1. Marinka Ž itnik & Bla ž Zupan University of Ljubljana, Slovenia Large-Scale Data Fusion Tutorial at BC^2, Basel 2015 by Collective Matrix Factorization

  2. Tutorial Overview ❖ Motivation: search for bacterial-response genes, a case study ❖ Warm up: recommender systems ❖ Data fusion: tri-factorization and sharing of latent features ❖ Examples: movies & genes ❖ Hands-on: visual programming and scripting ❖ Applications: case studies in prioritization and classification ❖ Other approaches: related work

  3. BC2 Tutorial on 
 Large-scale data fusion by collective matrix factorization Search of bacterial response Motivation genes in a social amoeba Dictyostelium discoideum.

  4. Motivation: Bacterial-Response Genes in Dicty Adam Kuspa Baylor College of Medicine, Gad Shaulsky Houston, USA Dictyostelium discoideum

  5. Search for Bacterial Response Genes Dicty is 50,000 clonal mutants bacterial predator! genetic screen Gram+ defective: genome 12,000 genes swp1, gpi, nagB1 found 7 genes Gram- defective: workload 5 years clkB, spc3, alyL, nip7 estimated ~200 genes

  6. Now What? 50% coverage (100 genes) 20 screens required! 80% coverage (160 genes) 65 screens required!

  7. Alternative: A Data-Driven Approach Genes Phenotypes Mutant nagB1 swp1 shkA spc3 nip7 alyL kif9 gpi Gram neg. defective Aberrant spore color Aberrant spore color Decreased chemotaxis Gram pos.defective

  8. So Much More Data … Decreased chemotaxis Aberrant spore color Gram neg. defective Aberrant spore color Gram pos.defective Phenotype Ontology Mutant Phenotypes Timepoints Publications spc3 swp1 Pubmed kif9 data alyL nagB1 Genes gpi shkA nip7 MeSH terms Expression Phenotype data data MeSH terms MeSH Ontology MeSH annotations

  9. Actually, in Biomedicine, Matrices Abound … V5 Normalized counts Experiments V4 T1 T2 T3 T4 T5 T6 T7 V3 V2 Genes V1 T1 T2 T3 T4 T5 T6 T7 Timepoints Gene Expression

  10. Actually, in Biomedicine, Matrices Abound … gemA gacT racM xacA racN racJ rdiA racI gemA gacT gacT rdiA gemA racN rdiA racJ racN racM racJ racI racI xacA xacA racM Gene Interactions

  11. Actually, in Biomedicine, Matrices Abound … Part of N-Glycan biosynthesis pathway Part of N-Glycan biosynthesis pathway alg13 alg7 alg1 alg2 alg14 Pathways Fructose and mannose metabolism alg11 Genes dpm1 dpm2 dpm3 alg3 alg12 alg9 GPI-anchor biosynthesis Gene Pathways

  12. Actually, in Biomedicine, Matrices Abound … Part of Gene Ontology graph Gene Ontology terms Response to Response to Response s external biotic m o to stress s t e i stimulus stimulus e s n n s a s n o g s o p r e o p s r e s t r s e e R Response to R h o t t o external biotic stimulus Gene Ontology terms Response Defense Response to to bacterium response other organisms Defense Response to Defense response to bacterium response other organism Defense response to bacterium Ontologies

  13. Actually, in Biomedicine, Matrices Abound … Medical Subject Headings MeSH terms Cell separation Literature Cytoplasmic vesicles/metabolism Ethidium/metabolism Immunity/innate Mutation Phagocytes/cytology Phagocytes/immunology* Phagocytosis* Ontologies of Controlled Vocabularies

  14. Actually, in Biomedicine, Matrices Abound … alg13 alg7 alg1 alg14 Ontology terms Part of N-Glycan biosynthesis pathway Fructose and mannose metabolism dpm1 dpm2 dpm3 Pathways Protein N-linked glycosylation (GO:0006487) Orthology Ontology Dolichol kinase (K00902) GO:0004168 Alpha-mannosidase II (K01231) GO:0004572 Oligosaccharyltransferase complex (K12668) GO:0008250 Cross-Links of Controlled Entities

  15. BC2 Tutorial on 
 Large-scale data fusion by collective matrix factorization Working with matrices: Warm-Up recommender systems, two- factorization and tri- factorization.

  16. Back to the Problem: Recommender Systems Genes Phenotypes Mutant nagB1 swp1 shkA spc3 nip7 alyL kif9 gpi Gram neg. defective Aberrant spore color Aberrant spore color Decreased chemotaxis Gram pos.defective Users Morgan Jenny Jerry John Mike Alex Kate Nick Passengers War of the Worlds Bride Wars The Matrix Reloaded Movies The Godfather The Dark Knight Pulp fiction Schindler’s List

  17. Recommender Systems

  18. Netflix Prize, 2009

  19. Example: A Small Relation Matrix d e d s a d o l l r e o R W s s r x r e e a i r g h W t a n t e M f User e o s d s e r i a a r h B W P T John 2 3 Kate 5 4 Alex 4 5 Mike 4 5 Movie

  20. Matrix Two-Factorization d e d s a d o l l r e o R W s s r L1 L2 x r e e a i r g h W t a n t M e John f e 0.2 0 o s d s e r i a a r h B W P T Kate 0 0.5 Alex 6.3 0 1.1 8 0.5 0 L1 L2 Mike 0 0.5 3.9 10.7 0 3.3

  21. Matrix Two-Factorization The Matrix Reloaded War of the Worlds Passengers Bride Wars John 2 3 1.3 0 0.2 1.6 Kate 5 4 2 5.4 0 1.7 ~ ~ 4 5 3.2 0 0.6 4 Alex 4 5 2 5.4 0 1.7 Mike

  22. Matrix Tri-Factorization The Matrix Reloaded War of the Worlds Passengers Bride Wars U1 U2 John 0.2 0.3 Kate M1 M2 0.8 0.2 Alex 0.9 0.2 0.2 1 0.7 1.2 -4.4 9.1 M1 U1 6.7 -5.8 Mike 0.6 0.8 0.1 0.7 0.8 1.2 U2 M2

  23. Matrix Tri-Factorization The Matrix Reloaded War of the Worlds Passengers Bride Wars John 2 3 1.1 0.3 0.2 1.2 Kate 5 4 1.7 4.5 0.2 2.1 ~ ~ 4 5 4.1 0.5 0.9 4.5 Alex 4 5 1.5 4.8 0.1 1.9 Mike

  24. Hands-On: Matrix Tri-Factorization

  25. BC2 Tutorial on 
 Large-scale data fusion by collective matrix factorization Collective matrix Data Fusion tri-factorization. Latent factor sharing.

  26. Data Fusion Graph: One Data Matrix A A Backbone B Recipe Tri-factorization matrix of B matrix of A of matrix A-B A-B Recipe matrix of B ~ x x = ~ Reconstructed matrix A-B

  27. Data Fusion Graph: Two Data Matrices A A E Backbone B E Recipe matrix of Backbone Collective tri-factorization B matrix of A A-E matrix of of matrices A-B and A-E A-E Recipe matrix of E Recipe matrix of B ~ x x = ~ Reconstructed matrix A-B ~ x x = ~ Reconstructed matrix A-E

  28. Data Fusion Graph: All Together Now A E B D C F G

  29. Sharing of Latent Matrices A E B D C F G

  30.   · · · ⇤ R 12 R 1 r 2 { · · · ⇤ R 21 R 2 r   Θ ( t ) = Diag( Θ ( t ) 1 , Θ ( t ) R = 2 , . . . , Θ ( t )  . . .  r ) ... . . .   . . .   · · · ⇤ R r 1 R r 2 S k 1 × k 2 S k 1 × k r 2 3 ⇤ · · · 12 1 r S k 2 × k 1 S k 2 × k r · · · ⇤ 6 7 21 2 r G = Diag( G n 1 × k 1 , G n 2 × k 2 , . . . , G n r × k r ) , S = 6 7 . . . ... 1 2 r . . . 6 7 . . . 4 5 S k r × k 1 S k r × k 2 · · · ⇤ r 1 r 2 j || 2 + X || R ij � G i S ij G T G ≥ 0 J ( G ; S ) min = R ij ∈ R max i t i tr ( G T Θ ( t ) G ) , X + t =1

  31. Input: A set R of relation matrices R ij ; constraint matrices Θ ( t ) for t 2 { 1 , 2 , . . . , max i t i } ; ranks k 1 , k 2 , . . . , k r ( i, j 2 [ r ] ). Output: Matrix factors S and G . 1) Initialize G i for i = 1 , 2 , . . . , r. 2) Repeat until convergence: • Construct R and G using their definitions in Eq. (1) and Eq. (3). • Update S using: S ( G T G ) − 1 G T RG ( G T G ) − 1 . (8) • Set G ( e ) 0 for i = 1 , 2 , . . . , r . i • Set G ( d ) 0 for i = 1 , 2 , . . . , r . i • For R ij 2 R : ij ) + + G i ( S ij G T G ( e ) ( R ij G j S T j G j S T ij ) − += i ij ) − + G i ( S ij G T G ( d ) ij ) + ( R ij G j S T j G j S T += i ij G i S ij ) + + G j ( S T G ( e ) ( R T ij G T i G i S ij ) − += j ij G i S ij ) − + G j ( S T i G i S ij ) + (10) G ( d ) ( R T ij G T += j • For t = 1 , 2 , . . . , max i t i : G ( e ) [ Θ ( t ) ] − G i += for i = 1 , 2 , . . . , r i i G ( d ) [ Θ ( t ) ] + G i += for i = 1 , 2 , . . . , r (11) i i • Construct G as: v v v t G ( e ) t G ( e ) t G ( e ) u u u u r u 1 u 2 G G � Diag( ) , (12) , , . . . , G ( d ) G ( d ) G ( d ) r 1 2 where � denotes the Hadamard product. The p · and · are · entry-wise operations.

  32. Hands-On: Collective Matrix Factorization

  33. BC2 Tutorial on 
 Large-scale data fusion by collective matrix factorization Scoring Data sampling. Completion scoring.

  34. Data Sampling & Completion

  35. Hands-On: Performance Evaluation

  36. Hands-On: Latent Profiling

  37. BC2 Tutorial on 
 Large-scale data fusion by collective matrix factorization The Yeast A large data compendium. Case Study Functional genomics.

  38. The Yeast Case Study Topics of yeast biology Literature Literature Topic Literature on functions and processes e r u t a r Genetic and e t i l physical e n Ontology Experiment e interactions G structure Gene expression data Gene Gene Gene annotation data Ontology Term Gene pathway data Biochemical pathway

  39. Hands-On: The Yeast Case Study

  40. Latent Matrix Chaining Topics of yeast biology Literature Literature Topic Gene literature Chain of latent matrices Gene Literature topic Gene = x = x Gene Literature topic Gene profile matrix

  41. Hands-On: Latent Matrix Chaining

Recommend


More recommend