graph theoretic latent class discovery and it s
play

Graph Theoretic Latent Class Discovery and Its Robustness to - PowerPoint PPT Presentation

Graph Theoretic Latent Class Discovery and Its Robustness to Minimal Dominating Set Choice J. L. Solka, C. E. Priebe, and D. J. Marchette jsolka@nswc.navy.mil;dmarche@nswc.navy.mil NSWCDD Interface04 p.1/24 Agenda What is latent


  1. Graph Theoretic Latent Class Discovery and It’s Robustness to Minimal Dominating Set Choice J. L. Solka, C. E. Priebe, and D. J. Marchette jsolka@nswc.navy.mil;dmarche@nswc.navy.mil NSWCDD Interface04 – p.1/24

  2. Agenda What is latent class discovery? What are some approaches to the latent class discovery process? The class cover catch digraph classifier. Latent class discovery results on a gene expression data set. Wrap-up and conclusions. Interface04 – p.2/24

  3. Acknowledgments Michael C. Minnotte and Jurgen Symanzik, and others for organizing the conference Office of Naval Research through their ILIR Program for funding this effort Interface04 – p.3/24

  4. What is Latent Class Discovery? A latent class is a class of observations that reside undiscovered within a known class of observations. Develop a general methodology for the discernment of latent class structure during discriminant analysis. Moderately large hyperdimensional data sets. During training or testing. Explore applications of developed methodologies to the analysis of data sets in the areas hyperdimensional image analysis, artificial olfactory systems, computer security data, gene expression data, and text data mining. Interface04 – p.4/24

  5. Flow Chart M U LT I D I M E N S I O NA L S CA L I N G I G RA P H T H E O R E T I C N L A T E N T H Y P E RD I M E N S I O NA L D I S CR I M I NAN T S C L A SS E S DA T A ANA L Y S I S I G H T S M E T R I C N O N L I N E AR S P AC E D I M E N S I O NA L I T Y ADA P T A T I O N R E DUC T I O N Interface04 – p.5/24

  6. Dominating Set D o m in a t in g s e t t w o − c l a ss d a t a a nd c o v er in g di s c s Interface04 – p.6/24

  7. CCCD-Based Latent Class Discovery 3 2 1 0 −1 −2 −3 −4 −5 −6 Interface04 – p.7/24 −7 −6 −5 −4 −3 −2 −1 0 1 2 3 4

  8. ALL/AML Leukemia Gene Expression Analysis 72 P a t i e n t s 7129 g e n e s A pp l y CCCD t o A LL O b s er va t i o n s = A M L = A LL B − ce ll C l u s t er CCCD = A LL T − ce ll S o l u t i o n B a s e d o n R a d ii E xa m i n e C l u s t er s f o r A s cer t a i n S i g n i f i c a n ce o f L a t e n t C l a ss S t r u c t u re L a t e n t C l a ss S t r u c t u re Interface04 – p.8/24

  9. Interface04 – p.9/24 ✡ ✗ ✭ ✩ ✩ ✩ ✧ ✡ ✬ ✬ ✪ ✲ ✤ ✄ ✴ ✸ ✰ ✬ ✭ ✧ ✹ ✸ ✔✺ ✒ ✻ ✩ ✙ ✚ ✩ ✖ ✢ ✮ ✖ ✗ ✸ ✔ � ✁ ✔✺ ✆ ✆ ✆ ✄ ✸ ✹ ✭ ✬ ✝ ✌ ✲ ✔ ✚ ✙ ✄ ✕ ✍ ✔ ✓ ✁ ✒ ✴ ✎ ✂ ✍ ✩ is Resubstitution Error ✝✟✠☛✡ ✮✱✰ an empirical risk (resubstitution error rate estimate) ✵✷✶ ✍✑✳ ✭✯✮✱✰ ✮✽✰ ✵✷✶ ✤✫✪ Rate Estimate ✍✑✳ ✙★✧ ✤✦✥ ✎✣✢ ✛✑✜ ✙★✧ ✖✘✗ ✤✦✥ ✛✑✼ ✖✘✗ ✍✑✏ ✝✟✞ ✂☎✄ calculated as ✠☞✡ For each

  10. ✡ ✴ ✆ � ✴ ✟ ✞ ✺ ✠ ✁ ✒ ✌ ✢ ✡ ☛ ✄ ✂ ✂ ☞ ✠ Classification Dimension ✝ ✁� We proceed by defining the “scale dimension” to be the cluster map dimension that ✝ ✄� ✝✑✠☞✡ minimizes a dimensionality-penalized empirical risk; ✛ ✁✝ for ✂✆☎ ✵✷✶ ✵✷✶ some penalty coefficient . Interface04 – p.10/24

  11. ALL/AML Classification Dimension Plot Interface04 – p.11/24

  12. Gene Latent Class Discovery Interface04 – p.12/24

  13. ALL/AML MDS Plot Interface04 – p.13/24

  14. How Robust is the Methodology? One other “success” story using artificial nose data. What if we had used another dominating set in our analysis? Is the discovered latent class structure independent of the dominating set used? Interface04 – p.14/24

  15. An Exhaustive Enumeration of All Possible Dominating Sets for the Gene Data 180 21 node solutions 16 of the nodes remain fixed across the solutions 14 greedy solutions Interface04 – p.15/24

  16. Classification Space Curves for the 180 Solutions 0.30 0.25 0.20 0.15 0.10 0.05 0.00 5 10 15 20 Interface04 – p.16/24

  17. Classification Dimension for the 180 Solutions (red o Greedy Solutions, Green * Previous Solution) 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 2 0 20 40 60 80 100 120 140 160 180 Interface04 – p.17/24

  18. Number of Dominating Sets for Each Vertex Number of Dominating sets for each vertex 150 T−Cell B−Cell In−degree 0 # Dominating Sets 100 50 0 0 10 20 30 40 Vertex Interface04 – p.18/24

  19. Interface04 – p.19/24 ❋ ✿ ❁ ✽ ❅ ❇ ✿ ❈ ❁ ❇ ❉ ✿ ❊ ❀ ● ✽ ❍ ❉ ■ ❅ ❏ ❍ ❁ ❑ ❈ ❍ ❋ ❅ ✾ ✽ ❆ ❅ ✾ ❣ ❨ ❩ ❬ ❭ ❪ ❫ ❴ ❵ ❛ ❜ ❝ ❞ ❡ ❢ ❤ ❄ ✐ ❥ ❦ ❧ ♠ ♥ ✼ ✽ ✾ ✿ ❀ ❁ ♦ ❃ ❅ ▲ ❲ ❀ q ❇ ❀ ❁ ❉ ✿ q ❏ r ❀ ■ ❖ ❋ ✾ ❁ ■ ❁ ❇ s ❋ q ✾ ■ ❀ ✽ ❏ ❍ ❖ P ❂ ✿ ❈ ❁ ■ ❀ ❏ ✽ ❈ ❁ ❉ ✽ ❅ ❏ ❍ ❁ ❂ ♣ ❇ ❖ ❏ ✽ ❅ ❋ ❏ ✽ ❅ ✾ ❉ ❁ ❏ ❉ ❏ ❍ ❋ ❳ ❱ � ✵ ★ ✩ ✪ ✫ ✬ ✭ ✮ ✯ ✰ ✱ ✲ ✳ ✴ ✶ ✦ ✷ ✸ ✹ ✺ ✻ ✼ ✽ ✾ ✿ ❀ ❁ ❂ ❃ ❄ ✧ ✥ ✽ ✎ ✁ ✂ ✄ ☎ ✆ ✝ ✞ ✟ ✠ ✡ ☛ ☞ ✌ ✍ ✏ ✤ ✑ ✒ ✓ ✔ ✕ ✖ ✗ ✘ ✙ ✚ ✛ ✜ ✢ ✣ ❅ ❆ ❯ ✽ ✽ ❈ ❁ ❉ ✽ ❅ ❏ ❍ ❁ ❂▼ ◆ ❇ ■ ❖ ❅ ❀ ❋ ❏ ✽ ❅ ✾ ❉ ❁ ❏ ❉ P ◗ ❘ ❙ ❚ ❏ ❁ ✿ ❍ ❁ ✽ ❅ ❇ ✿ ❈ ❁ ❇ ❉ ✿ ❊ ❀ ❋ ● ❉ ▲ ■ ❅ ❏ ❍ ❁ ❑ ❈ ❍ ❋ ❅ ✾ ✽ ❅ ✾ Digraph Analysis

  20. Latent Class Discovery Figures of Merit How can we be assured that all of the greedy dominating set solutions discover the same latent classes? Previous greedy solution had 3 clusters that are pure B and 1 cluster that contained 8/9 of the T observations Percentage of B points that are in pure B clusters and the highest percentage of T points in any one cluster Interface04 – p.20/24

  21. Purity (Latent Class Discovery) for the Golub Gene Data , Red Triangles are the Greedy Solutions 1.00 0.95 0.90 tpercent 0.85 0.80 0.4 0.5 0.6 0.7 0.8 0.9 bpercent Interface04 – p.21/24

  22. Remaining Questions Demonstrated similar latent class discovery among all of the greedy dominating set solutions Many of the 7129 variates (genes) are superfluous to the discriminant analysis problem Work is ongoing to examine the discovered latent classes based on subsets of the genes Various figures of merit have been used to choose the subsets of the genes Interface04 – p.22/24

  23. Conclusions Developed a new concept for latent class discovery during discriminant analysis Illustrated one graph theoretic methodology for the discovery of the latent classes Illustrated this methodology with a gene expression data set. Presented some preliminary results examining the robustness of the discovery process to the cccd process Interface04 – p.23/24

  24. Readings C. E. Priebe, J. L. Solka, D. J. Marchette, and B. T. Clark, “Class Cover Catch Digraphs for Latent Class Discovery in Gene Expression Monitoring by DNA Microarrays,” to appear the Special Issue of Computational Statistics and Data Analysis on Statistical Visualization, 2002+. J. L. Solka, C. E. Priebe, and B. T. Clark, “A Visualization Framework for the Analysis of Hyperdimensional Data,” in International Journal of Image and Graphics Special Issue on Data Mining, 2002. Marchette, D.J., Priebe, C.E., “Characterizing the scale dimension of a high-dimensional classification problem,” in Pattern Recognition,2002 Interface04 – p.24/24

Recommend


More recommend