Association rules and compositional data analy lysis: : im impli licatio ions to big ig data R. S. Kenett 1 , J.A. Martín - Fernández 2 , S. Thió -Henestrosa 2 and M. Vives-Mestres 2 1 KPA Group, Israel; University of Turin, Italy and Neaman Institute, Technion, Israel 2 Universitat de Girona, Spain
CoDaWork 2017 2
This is work in progress The long term goal is to introduce CoDa to text (semantic data) analysis and to scale it to big data … . CoDaWork 2017 3
Association Rules(AR) Transaction (document, itemset) Antecedent Consequent Terms, items, tokens, words LHS (A) RHS (B) Basket Analysis CoDaWork 2017 4
AR: : Support, Confidence, , Lift ft and Odd Ratio Proportion of transactions in which an item set appears support {A=>B} = x 1 B Strength of implication, or predictive power RHS ^RHS A x 1 x 2 g confidence {A=>B} = x 1 /g LHS 1-g x 3 x 4 ^LHS Lift < 1, A and B repel each other Lift > 1, A and B have affinity to each other 1 f 1-f lift {A=>B} = confidence{A=>B} / support{B} 4 x 1, 0 x i , 1...4. = support{A=>B}/support{A}support{B} i i i 1 OR < 1, A and B repel each other OR > 1, A and B have affinity to each other OR {A=>B} = (x1*x4)/(x2*x3) CoDaWork 2017 5
The Simplex B RHS ^RHS x 1 x 2 A LHS x 3 x 4 ^LHS 4 x 1, 0 x i , 1...4. i i i 1 CoDaWork 2017 6
Kenett R.S. (2014). Frequenct vectors and contingency tables: a non paramtric and graphical analysis. Girona Seminar, 27/11/14. RHS ^RHS Relative Linkage Disequilibrium g x 1 x 2 LHS 1-g x 3 x 4 ^LHS D x x x x D 4 2 3 RLD f 1-f 1 D M independence dependence X f g De e OR {A=>B} = 1 where D lift {A=>B} = 1 M D D 0 X ( , x x , x x , ) 1 2 3 4 f ( ,1 f f ) g ( ,1 g g ) e (1, 1) Kenett, R.S. (1983). On an Exploratory Analysis of Contingency CoDaWork 2017 7 Tables. J R Stat Soc Series D , 32, 395 — 403.
CoDa Analysis and Principles • Scale invariance (Vectors P = [ p 1 , … ,p D ] and P ’ = α P , α > 0, give the same information • Subcompositional coherence • Multiplicative tools to CoDa are equivalent to classical additive (Euclidean) tools to log-ratio values • Transform CoDa, e.g. isometric log-ratio coordinates: ilr ( x ) CoDaWork 2017 8
Logratio (multiplicative) approach Simplex: raw data (%) Real space: log-ratio coordinates (alr, clr, or ilr) 𝑦 1 𝑦 2 𝑦 𝐸 𝒚 = (𝑦 1 , 𝑦 2 ,..., 𝑦 𝐸 ) clr( 𝒚) = (log( 𝑦 ), log( 𝑦 ) ,..., log( 𝑦 )) ilr( 𝒚) =(ilr 1 ( 𝒚), … , ilr D−1 ( 𝒚) ) x3 1.5 3 0.5 x 1 x 2 ilr 2 -0.5 x 3 x1 x2 -1.5 -2 -1 0 1 2 ilr 1 CoDaWork 2017 9
CoDa Analysis of f 2X2 2 tables T B RHS ^RHS A x 1 x 2 g LHS x 3 x 4 1-g ^LHS 1 f 1-f 4 x 1, 0 x i , 1...4. i i i 1 1 2 ln 𝑦 1 𝑦 4 2 ln 𝑦 1 2 2 ln 𝑦 2 2 𝑗𝑚𝑠 𝐔 = , , . 𝑦 2 𝑦 3 𝑦 4 𝑦 3 Sequential Binary Partition (SBP), Pawlowsky-Glahn and Buccianti, 2011, Chapter 2. CoDaWork 2017 10
ilr 1 ( T ) < 0 : negative effect between itemsets ( A true, B less likely true) CoDa Analysis of f 2X2 2 tables ilr 1 ( T ) = 0 : independence ilr 1 ( T ) > 0 : positive effect ( A true, B more likely true) ilr-coordinates ilr 1 ilr 2 ilr 3 1 2 ln 𝑦 1 𝑦 4 2 ln 𝑦 1 2 2 ln 𝑦 2 2 T 𝑦 2 𝑦 3 𝑦 4 𝑦 3 2 ln 𝑦 1 2 2 ln 𝑦 2 2 independence T ind 0 𝑦 4 𝑦 3 1 2 ln 𝑦 1 𝑦 4 interaction T int 0 0 𝑦 2 𝑦 3 Perturbation operation subtracting table T ind from T CoDaWork 2017 11
independence interaction CoDa Analysis of f 2X2 2 tables X f g De e ilr( T )=ilr( T ind )+ilr( T int ). 𝐔 𝑏 = 𝑗𝑚𝑠(𝐲) be the Aitchison norm of a table T , Let 2 = 2 + 𝐔 𝑗𝑜𝑢 2 , that is, one has a 𝐔 𝑏 𝐔 𝑗𝑜𝑒 then 𝑏 𝑏 decomposition of the Aitchison norm of table T . CoDaWork 2017 12
CoDaWork 2017 13
CoDa Simplical Deviance (S (SD) independence interaction X f g De e 2 = 1 𝑦 1 𝑦 4 2 (𝐔) 4 l𝑜 2 𝑇𝐸(𝐔) = 𝐔 𝑗𝑜𝑢 𝑦 2 𝑦 3 = 𝑗𝑚𝑠 𝑏 1 𝑦 1 𝑦 4 = 1 ⟺ 𝑚𝑜 𝑦 1 𝑦 4 = 0 ⟺ 𝑗𝑚𝑠 1 𝐔 = 0 ⟺ 𝑇𝐸 = 0 𝑦 2 𝑦 3 𝑦 2 𝑦 3 CoDaWork 2017 14
CoDa Relative Simplical Deviance (S (SD) 2 (𝐔 ) 𝑆𝑇𝐸(𝐔) = 𝑇𝐸 𝑗𝑚𝑠 D 1 2 = RLD 2 D 𝐔 𝑏 ) 𝑗𝑚𝑠(𝐔 M RSD takes values in an interval [0,1] If D 0 else then if x x if x x 1 4 3 2 D D then RLD then RLD D x D x 1 3 D D else RLD else RLD D x D x 2 4 CoDaWork 2017 15
Bootstrap Algorithm Egozcue et al. (2015) introduce a bootstrap algorithm consisting of following steps: i) Calculate T ind , T int , SD and RSD . ii) Simulate 10000 multinomial samples ( T (k) ) assuming the independence hypothesis H 0 : T = T ind is true. For each table T (k) , calculate T (k) ind , T (k) int , SD (k) and RSD (k) . iii) Compare respectively the value of SD and RSD with the distribution of the 10000 values of SD (k) and RSD (k) to obtain the percentile p-value ( left tail ). Calculate the 0.05 significance critical points (5 th quantile) in the left tail of each distribution. CoDaWork 2017 16
CoDa Measures for Association Rules 𝑦 1 lift(AR) = 𝑦 1 + 𝑦 2 )(𝑦 1 + 𝑦 3 𝐸(AR) = 𝑦 1 𝑦 4 − 𝑦 2 𝑦 3 ) 𝐸(AR lift AR = 1 + 𝑦 1 + 𝑦 2 )(𝑦 1 + 𝑦 3 CoDaWork 2017 17
CoDa Measures for Association Rules OR (AR) = odds( B / A )/odds( B / c A ) = ( x 1 x 4 )/( x 2 x 3 ). OR(AR) =1 D(AR) =0 Lift(AR) =1 𝑃𝑆 ∗ AR = 𝑍𝑣𝑚𝑓 ′ 𝑡 𝑅 𝐵𝑆 = 𝑦 1 𝑦 4 −𝑦 2 𝑦 3 𝑦 1 𝑦 4 +𝑦 2 𝑦 3 OR is defined in [0, +Infinite), OR* is defined in [-1,1] CoDaWork 2017 18
CoDa Measures for Association Rules 𝐷 AR = 𝑗𝑚𝑠 𝐔 1 𝐷 ∗ AR = tanh 𝐷 AR = 𝑃𝑆 ∗ AR = 𝑍𝑣𝑚𝑓 ′ 𝑡 𝑅 𝐵𝑆 C is defined in (-Infinite, +Infinite), C* is defined in [-1,1] CoDaWork 2017 19
CoDa Measures for Association Rules - 𝐷 (A (AR) CoDaWork 2017 20
A Case Study CoDaWork 2017 21
https://treato.com CoDaWork 2017 22
https://treato.com/Nicardipine/?a=s CoDaWork 2017 23
CoDaWork 2017 24
CoDaWork 2017 25
Document Term Matrix (DTM) CoDaWork 2017 26
Association Rules Analysis CoDaWork 2017 27
Association Rules by consequent Vasoconstriction (lowest Lift) CoDaWork 2017 28
Top 10 Association Rules by Lift (in red) CoDaWork 2017 29
Top 10 Association Rules by Lift (in red) CoDaWork 2017 30
CoDa AR Vis isualization ilr ilr plo lot by consequent it item 15 interaction frequency 10 5 0 0.2 0.4 0.6 0.8 1.0 ilr.1 CoDaWork 2017 31
CoDa AR Visualization ilr ilr plot interaction CoDaWork 2017 32
CoDa AR Vis isualization clr lr plo lot by consequent it item interaction CoDaWork 2017 33
Conclusions • Compositional measures of independence SD and RSD are coherent with the simplicial geometry of the simplex, the sample space of contingency tables of AR. • The relation between CoDa-AR measures and other common measures facilitates the interpretation of negative and positive effects between itemsets. • The CoDa geometry provides visualization techniques of measures when all the significant AR of a large database are analyzed. • The principles of coherence and scalability, that are fundamental to CoDa, are relevant to big data text analysis. • More research in this area is needed CoDaWork 2017 34
Acknowledgements https://www.kaggle.com/c/instacart-market-basket-analysis CoDaWork 2017 35
Thank you for your attention
Recommend
More recommend