unusual tensor decompositions for informatics applications
play

Unusual Tensor Decompositions for Informatics Applications Brett W. - PowerPoint PPT Presentation

Unusual Tensor Decompositions for Informatics Applications Brett W. Bader Sandia National Laboratories NSF Tensor Workshop February 20, 2009 Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the


  1. Unusual Tensor Decompositions for Informatics Applications Brett W. Bader Sandia National Laboratories NSF Tensor Workshop February 20, 2009 Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy ʼ s National Nuclear Security Administration under contract DE-AC04-94AL85000.

  2. Acknowledgements • Richard Harshman (Univ. Western Ontario) • Peter Chew (Sandia) • Tammy Kolda (Sandia) • Ahmed Abdelali (NMSU)

  3. Tensor Decompositions Tensor + + ... PARAFAC2 PARAFAC 3-way DEDICOM Tucker ...and many more! Each provides a different interpretation of the data

  4. Temporal Analysis of Enron email using 3-way DEDICOM Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy ʼ s National Nuclear Security Administration under contract DE-AC04-94AL85000.

  5. Three-way DEDICOM X x = AD k RD k A T k = 1 , . . . , K D D A T R = A X • Introduced by Harshman (1978) • DEcomposition into DIrectional COMponents • Columns of A are not necessarily orthogonal • Central matrix R contains asymmetric information from X • *Unique* solution with enough slices of X with sufficient variation - i.e., no rotation of A possible - greater confidence in interpretation of results • Alternating algorithms; least-squares and approximations • Early applications: - World trade (import/export matrices) - Car switching • Variations: constrainted DEDICOM

  6. Application: Enron Email Analysis Alice Carl Bob David Frank Gary Ellen Henk Ingrid • Links consist of email communications • What can we learn about this network strictly from their communication patterns? (Social network analysis)

  7. Case Study: Enron • Enron created energy markets • !"#$"/A$#> EnronOnline: e-trading business - natural gas !"#$" !"#$" !"#$" !"#$" !"#$" - )(*+$#,- )$#*4 /!"(#28 ;#$1<=1"< @#1"->$#*1*'$" electric power 56(#'71 /9(#:'7(- 9(#:'7(- • Investigations - !"#$"%"&'"( !"#$" !"#$" .$+(#/01#,(*'"2 .'>(?'"(- FERC • !"#$" energy market manipulation 31-/01#,(*'"2 • involved energy traders !"#$" - 3("(#1*'$" SEC • accounting fraud • insider trading 3500 Email communications at Enron (1998-2002) 3000 2500 Messages 2000 1500 1000 500 0 N D 99 F M A M J J A S O N D 00 F M A M J J A S O N D 01 F M A M J J A S O N D 02 F M A M J Month

  8. Temporal Social Network Analysis Email communications at Enron (1998-2002) (data released by U.S. Federal Energy Regulatory Commission) 3500 3000 Emails among 184 employees 2500 Messages over 44 months 2000 1500 1000 500 0 N D 99 F M A M J J A S O N D 00 F M A M J J A S O N D 01 F M A M J J A S O N D 02 F M A M J Month Time series of Adjacency array communication graphs among employees April March DEDICOM February January Joint work with R. Harshman (UWO) and T. Kolda

  9. Roles of Employees time patterns ) s r i a f f a ) e t d ʼ v a o r g t ( roles ( e e v e v i t n i t u u i l l c e a c e p g e x e i x P E L E Identify shared L. Kitchen - President, Enron Online 0.11 -0.09 0.53 0.00 Soft clustering characteristics to label group Bi-plots of two roles 0.6 0.6 Unaffiliated K. Watson J. Dasovich − Employee, Government Relationship Executive Transwestern Pipeline Company (ETS) Executive J. Steffes − VP, Government Affairs 0.5 Legal (ENA) 0.5 Pipeline (ETS) M. Lokay Energy Trader 0.4 Admin. Asst. R. Shapiro − VP, Regulatory Affairs 0.4 Transwestern Pipeline Company (ETS) S. Kean − VP, Chief of Staff L. Donoho − Employee, Transwestern Pipeline Company (ETS) 0.3 0.3 Column 2 Column 4 M. McConnell − Employee, Transwestern Pipeline Company (ETS) 0.2 R. Sanders − VP, Enron Wholesale Services L. Blair − Employee, Northern Natural Gas Pipeline (ETS) 0.2 0.1 T. Jones S. Shackleton Financial Trading Group ENA Legal 0.1 ENA Legal L. Kitchen 0 President M. Taylor Enron Online Manager 0 − 0.1 Financial Trading Group J. Lavorato ENA Legal CEO, Enron America − 0.2 − 0.1 − 0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 − 0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 Column 1 Column 3

  10. Communication Patterns time patterns roles 157.8 Gov't Executive affairs 211.6 286.7 role 93.5 role 13.8 13.4 Legal Pipeline 440.2 172.4 role role • Mostly communication within roles • Some asymmetric exchanges

  11. Temporal Patterns time patterns roles Communication patterns over time 0.35 Legal Group 1 0.3 Government & regulatory affairs Group 2 Trade executives Group 3 Normalized scale 0.25 Group 4 Pipeline employee 0.2 0.15 0.1 0.05 0 N D 99 F M A M J J A S O N D 00 F M A M J J A S O N D 01 F M A M J J A S O N D 02 F M A M J Month Enron crisis breaks; Filed for investigation begins bankruptcy

  12. Multilingual Text Analysis using PARAFAC2 Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy ʼ s National Nuclear Security Administration under contract DE-AC04-94AL85000.

  13. PARAFAC2 X k ≈ A k C k B T C B T = X A • Introduced by Harshman (1972) • Less constrained than PARAFAC • Related to 3-way DEDICOM • Slices of A are constrained but not necessarily orthogonal • *Unique* solution with enough slices of X with sufficient variation - i.e., no rotation of A possible - greater confidence in interpretation of results • Alternating algorithms: least-squares and approximations • Early applications: - Sets of cross-product matrices - Chromatographic data with retention time shifts

  14. Cross-language Information Retrieval (CLIR) Web documents could be in any language English German Japanese French Chinese Simplified Languages on the web Spanish Russian Dutch Korean Polish Goal: Cluster documents Portuguese Chinese Traditional Swedish by topic regardless of Czech Norwegian language Italian Danish Hungarian Finnish Hebrew Arabic English Turkish Slovak French Indonesian Bulgarian Croatian Arabic Catalan Slovenian Greek Spanish Romanian Serbian Estonian Icelandic Lithuanian Latvian

  15. Bible as Parallel Corpus Linguistic differences among translations Translation Terms Total Words English (King James) 12,335 789,744 Spanish (Reina Valera 1909) 28,456 704,004 Russian (Synodal 1876) 47,226 560,524 Arabic (Smith Van Dyke) 55,300 440,435 French (Darby) 20,428 812,947 • Languages convey information in different number of words - Isolating language: One morpheme per word • e.g., "He travelled by hovercraft on the sea." Largely isolating, but travelled and hovercraft each have two morphemes per word. - Synthetic language: High morpheme-per-word ratio • e.g., Aufsichtsratsmitgliederversammlung => "On-view-council-with-limbs- gathering" meaning "meeting of members of the supervisory board".

  16. Term-Doc Matrix Term-by-verse matrix for all languages Bible verses English Look for co-occurrence of Spanish terms in the same verses and across languages to Russian terms capture latent concepts Arabic French 163,745 x 31,230

  17. Latent Semantic Indexing Term-by-verse matrix k � for all languages A k = U k Σ k V T σ i u i v T k = i Document i =1 Bible verses feature Σ T English V vector U dimension 1 0.1375 Spanish dimension 2 0.1052 dimension 3 0.0341 Truncated SVD Projection dimension 4 0.0441 dimension 5 -0.0087 dimension 6 0.0410 Russian dimension 7 0.1011 terms dimension 8 0.0020 dimension 9 0.0518 dimension 10 0.0822 dimension 11 -0.0101 Arabic dimension 12 -0.1154 dimension 13 -0.0990 dimension 14 0.0228 dimension 15 -0.0520 dimension 16 0.1096 dimension 17 0.0294 French dimension 18 0.0495 dimension 19 0.0553 dimension 20 0.1598 term x concept Project new documents of interest into subspace Σ of U -1 and compute cosine similarities

  18. Quran as Test Set • Quran is translated into many languages, just like the Bible • 114 suras (or chapters) • More variation across translations = harder clustering task

  19. Performance Metrics • MP5: Average multilingual precision at 5 (or n) documents - The average percentage of the top 5 documents that are translations of the query document - Calculated as an average for all languages - Essentially, MP5 measures success in multilingual clustering Lang 1 Lang 2 query ? ?

  20. LSA Results 5 languages, 240 latent dimensions Method Average MP5 SVD/LSA 65.5% Documents tend to cluster more by language than by topic

  21. New Approach: Multi-matrix Array (Chew, Bader, Kolda, Abdelali, 2007) French Term-by-verse matrix X5 Arabic for each language Russian X4 Spanish X3 X2 English X1 Array size: 55,300 x 31230 x 5 with 2,765,719 nonzeros

  22. Tucker1 Tucker ≈ Tucker1 ≈ VT = S1 X1 U1 X2 U2 X3 U3

  23. Tucker1 Results 5 languages, 240 latent dimensions Method Average MP5 SVD/LSA 65.5% Tucker1 71.3% Only minor improvement because each U k is not orthogonal

  24. PARAFAC2 (Harshman, 1972) X k ≈ U k HS k V T Where each U k is orthonormal and S k is diagonal

Recommend


More recommend