association rules and compositional
play

Association rules and compositional data analy lysis: : im impli - PowerPoint PPT Presentation

Association rules and compositional data analy lysis: : im impli licatio ions to big ig data R. S. Kenett 1 , J.A. Martn - Fernndez 2 , S. Thi -Henestrosa 2 and M. Vives-Mestres 2 1 KPA Group, Israel; University of Turin, Italy and


  1. Association rules and compositional data analy lysis: : im impli licatio ions to big ig data R. S. Kenett 1 , J.A. Martín - Fernández 2 , S. Thió -Henestrosa 2 and M. Vives-Mestres 2 1 KPA Group, Israel; University of Turin, Italy and Neaman Institute, Technion, Israel 2 Universitat de Girona, Spain

  2. CoDaWork 2017 2

  3. This is work in progress The long term goal is to introduce CoDa to text (semantic data) analysis and to scale it to big data … . CoDaWork 2017 3

  4. Association Rules(AR) Transaction (document, itemset) Antecedent Consequent Terms, items, tokens, words LHS (A) RHS (B) Basket Analysis CoDaWork 2017 4

  5. AR: : Support, Confidence, , Lift ft and Odd Ratio Proportion of transactions in which an item set appears support {A=>B} = x 1 B Strength of implication, or predictive power RHS ^RHS A x 1 x 2 g confidence {A=>B} = x 1 /g LHS 1-g x 3 x 4 ^LHS Lift < 1, A and B repel each other Lift > 1, A and B have affinity to each other 1 f 1-f lift {A=>B} = confidence{A=>B} / support{B} 4       x 1, 0 x i , 1...4. = support{A=>B}/support{A}support{B} i i  i 1 OR < 1, A and B repel each other OR > 1, A and B have affinity to each other OR {A=>B} = (x1*x4)/(x2*x3) CoDaWork 2017 5

  6. The Simplex B RHS ^RHS x 1 x 2 A LHS x 3 x 4 ^LHS 4       x 1, 0 x i , 1...4. i i  i 1 CoDaWork 2017 6

  7. Kenett R.S. (2014). Frequenct vectors and contingency tables: a non paramtric and graphical analysis. Girona Seminar, 27/11/14. RHS ^RHS Relative Linkage Disequilibrium g x 1 x 2 LHS 1-g x 3 x 4   ^LHS D x x x x D   4 2 3 RLD f 1-f 1 D M independence dependence     X f g De e OR {A=>B} = 1 where D lift {A=>B} = 1 M D  D 0  X ( , x x , x x , ) 1 2 3 4   f ( ,1 f f )   g ( ,1 g g )   e (1, 1) Kenett, R.S. (1983). On an Exploratory Analysis of Contingency CoDaWork 2017 7 Tables. J R Stat Soc Series D , 32, 395 — 403.

  8. CoDa Analysis and Principles • Scale invariance (Vectors P = [ p 1 , … ,p D ] and P ’ = α P , α > 0, give the same information • Subcompositional coherence • Multiplicative tools to CoDa are equivalent to classical additive (Euclidean) tools to log-ratio values • Transform CoDa, e.g. isometric log-ratio coordinates: ilr ( x ) CoDaWork 2017 8

  9. Logratio (multiplicative) approach  Simplex: raw data (%) Real space: log-ratio coordinates (alr, clr, or ilr) 𝑦 1 𝑦 2 𝑦 𝐸 𝒚 = (𝑦 1 , 𝑦 2 ,..., 𝑦 𝐸 ) clr( 𝒚) = (log( 𝑕 𝑦 ), log( 𝑕 𝑦 ) ,..., log( 𝑕 𝑦 )) ilr( 𝒚) =(ilr 1 ( 𝒚), … , ilr D−1 ( 𝒚) ) x3 1.5 3 0.5 x 1 x 2 ilr 2 -0.5 x 3 x1 x2 -1.5 -2 -1 0 1 2 ilr 1 CoDaWork 2017 9

  10. CoDa Analysis of f 2X2 2 tables T B RHS ^RHS A x 1 x 2 g LHS x 3 x 4 1-g ^LHS 1 f 1-f 4       x 1, 0 x i , 1...4. i i  i 1 1 2 ln 𝑦 1 𝑦 4 2 ln 𝑦 1 2 2 ln 𝑦 2 2 𝑗𝑚𝑠 𝐔 = , , . 𝑦 2 𝑦 3 𝑦 4 𝑦 3 Sequential Binary Partition (SBP), Pawlowsky-Glahn and Buccianti, 2011, Chapter 2. CoDaWork 2017 10

  11. ilr 1 ( T ) < 0 : negative effect between itemsets ( A true, B less likely true) CoDa Analysis of f 2X2 2 tables ilr 1 ( T ) = 0 : independence ilr 1 ( T ) > 0 : positive effect ( A true, B more likely true) ilr-coordinates ilr 1 ilr 2 ilr 3 1 2 ln 𝑦 1 𝑦 4 2 ln 𝑦 1 2 2 ln 𝑦 2 2 T 𝑦 2 𝑦 3 𝑦 4 𝑦 3 2 ln 𝑦 1 2 2 ln 𝑦 2 2 independence T ind 0 𝑦 4 𝑦 3 1 2 ln 𝑦 1 𝑦 4 interaction T int 0 0 𝑦 2 𝑦 3 Perturbation operation subtracting table T ind from T CoDaWork 2017 11

  12. independence interaction     CoDa Analysis of f 2X2 2 tables X f g De e ilr( T )=ilr( T ind )+ilr( T int ). 𝐔 𝑏 = 𝑗𝑚𝑠(𝐲) be the Aitchison norm of a table T , Let 2 = 2 + 𝐔 𝑗𝑜𝑢 2 , that is, one has a 𝐔 𝑏 𝐔 𝑗𝑜𝑒 then 𝑏 𝑏 decomposition of the Aitchison norm of table T . CoDaWork 2017 12

  13. CoDaWork 2017 13

  14. CoDa Simplical Deviance (S (SD) independence interaction     X f g De e 2 = 1 𝑦 1 𝑦 4 2 (𝐔) 4 l𝑜 2 𝑇𝐸(𝐔) = 𝐔 𝑗𝑜𝑢 𝑦 2 𝑦 3 = 𝑗𝑚𝑠 𝑏 1 𝑦 1 𝑦 4 = 1 ⟺ 𝑚𝑜 𝑦 1 𝑦 4 = 0 ⟺ 𝑗𝑚𝑠 1 𝐔 = 0 ⟺ 𝑇𝐸 = 0 𝑦 2 𝑦 3 𝑦 2 𝑦 3 CoDaWork 2017 14

  15. CoDa Relative Simplical Deviance (S (SD) 2 (𝐔 ) 𝑆𝑇𝐸(𝐔) = 𝑇𝐸 𝑗𝑚𝑠 D 1  2 = RLD 2 D 𝐔 𝑏 ) 𝑗𝑚𝑠(𝐔 M RSD takes values in an interval [0,1]   If D 0 else then     if x x if x x 1 4 3 2 D D       then RLD then RLD   D x D x 1 3 D D       else RLD else RLD   D x D x 2 4 CoDaWork 2017 15

  16. Bootstrap Algorithm Egozcue et al. (2015) introduce a bootstrap algorithm consisting of following steps: i) Calculate T ind , T int , SD and RSD . ii) Simulate 10000 multinomial samples ( T (k) ) assuming the independence hypothesis H 0 : T = T ind is true. For each table T (k) , calculate T (k) ind , T (k) int , SD (k) and RSD (k) . iii) Compare respectively the value of SD and RSD with the distribution of the 10000 values of SD (k) and RSD (k) to obtain the percentile p-value ( left tail ). Calculate the 0.05 significance critical points (5 th quantile) in the left tail of each distribution. CoDaWork 2017 16

  17. CoDa Measures for Association Rules 𝑦 1 lift(AR) = 𝑦 1 + 𝑦 2 )(𝑦 1 + 𝑦 3 𝐸(AR) = 𝑦 1 𝑦 4 − 𝑦 2 𝑦 3 ) 𝐸(AR lift AR = 1 + 𝑦 1 + 𝑦 2 )(𝑦 1 + 𝑦 3 CoDaWork 2017 17

  18. CoDa Measures for Association Rules OR (AR) = odds( B / A )/odds( B / c A ) = ( x 1 x 4 )/( x 2 x 3 ). OR(AR) =1 D(AR) =0 Lift(AR) =1 𝑃𝑆 ∗ AR = 𝑍𝑣𝑚𝑓 ′ 𝑡 𝑅 𝐵𝑆 = 𝑦 1 𝑦 4 −𝑦 2 𝑦 3 𝑦 1 𝑦 4 +𝑦 2 𝑦 3 OR is defined in [0, +Infinite), OR* is defined in [-1,1] CoDaWork 2017 18

  19. CoDa Measures for Association Rules 𝐷 AR = 𝑗𝑚𝑠 𝐔 1 𝐷 ∗ AR = tanh 𝐷 AR = 𝑃𝑆 ∗ AR = 𝑍𝑣𝑚𝑓 ′ 𝑡 𝑅 𝐵𝑆 C is defined in (-Infinite, +Infinite), C* is defined in [-1,1] CoDaWork 2017 19

  20. CoDa Measures for Association Rules - 𝐷 (A (AR) CoDaWork 2017 20

  21. A Case Study CoDaWork 2017 21

  22. https://treato.com CoDaWork 2017 22

  23. https://treato.com/Nicardipine/?a=s CoDaWork 2017 23

  24. CoDaWork 2017 24

  25. CoDaWork 2017 25

  26. Document Term Matrix (DTM) CoDaWork 2017 26

  27. Association Rules Analysis CoDaWork 2017 27

  28. Association Rules by consequent Vasoconstriction (lowest Lift) CoDaWork 2017 28

  29. Top 10 Association Rules by Lift (in red) CoDaWork 2017 29

  30. Top 10 Association Rules by Lift (in red) CoDaWork 2017 30

  31. CoDa AR Vis isualization ilr ilr plo lot by consequent it item 15 interaction frequency 10 5 0 0.2 0.4 0.6 0.8 1.0 ilr.1 CoDaWork 2017 31

  32. CoDa AR Visualization ilr ilr plot interaction CoDaWork 2017 32

  33. CoDa AR Vis isualization clr lr plo lot by consequent it item interaction CoDaWork 2017 33

  34. Conclusions • Compositional measures of independence SD and RSD are coherent with the simplicial geometry of the simplex, the sample space of contingency tables of AR. • The relation between CoDa-AR measures and other common measures facilitates the interpretation of negative and positive effects between itemsets. • The CoDa geometry provides visualization techniques of measures when all the significant AR of a large database are analyzed. • The principles of coherence and scalability, that are fundamental to CoDa, are relevant to big data text analysis. • More research in this area is needed CoDaWork 2017 34

  35. Acknowledgements https://www.kaggle.com/c/instacart-market-basket-analysis CoDaWork 2017 35

  36. Thank you for your attention

Recommend


More recommend