Intuitive Parameterization of Distance-Based Clustering Techniques Altobelli de Brito Mantuan Leandro A. F. Fernandes amantuan@ic.uff.br laffernandes@ic.uff.br Many Faces of Distances - Campinas, Brazil - 2014
Conventional Pipeline Input Incidence Matrix Mined Rules A B C D E F T1 0 0 1 1 0 1 {A, C, F} → {B} Apriori T2 1 1 1 0 1 0 {A, D} → {F} ... ... ... ... ... ... ... TN 0 1 0 0 1 1 The typical example Transaction ID Milk Bread Butter Beer T1 1 1 0 0 T2 0 0 1 0 T3 0 0 0 1 T4 1 1 1 0 Performance T5 0 1 0 0 and scalability issues Association Rule: {Butter, Bread} → {Milk} Support = 20% Confidence = 50% Many Faces of Distances - Campinas, Brazil - 2014 2
Proposed Approach Input Incidence Matrix Mined Rules A B C D E F T1 0 0 1 1 0 1 {A, C, F} → {B} Apriori T2 1 1 1 0 1 0 {A, D} → {F} ... ... ... ... ... ... ... TN 0 1 0 0 1 1 Clustering & Pruning Euclidean distance is not intuitive to parameterize Response Dual Style Space Scaling clustering techniques Many Faces of Distances - Campinas, Brazil - 2014 3
Defining the Response Style Space • Dual scaling [Nishisato 1993] ▪ Versatile method typically applied in marketing research ▪ Analysis of preferences of human subjects ▪ Graphical representation of o Response-style patterns among surveyed transition o The preference over a set of item Nishisato, S. “On quantifying different types of categorical data”. Many Faces of Distances - Campinas, Brazil - 2014 4 In: Psychometrika. 58(4), pp.617-629, 1993.
** Fictitious data Points in Response Style Space A space where transition and item T2 are represented as points T12 E T19 T3 T21 T22 T8 T27 T T18 T10 Transition T is more related to T25 ( i.e. , prefers) item A than to E T1 T4 T11 A T13 T26 T15 T5 T17 T9 T7 T16 T14 T6 T20 T24 Mapped transactions Mapped items Many Faces of Distances - Campinas, Brazil - 2014 5
** Fictitious data Emerging Contexts A context emerges from the existence of groups of items having similar preferences A set of transition with similar preferences Elements in the same context are likely to be part of significant itemsets Mapped transactions Mapped items Many Faces of Distances - Campinas, Brazil - 2014 6
Dendrogram of Items in Response Style Space Euclidean distance in reponse-style space Items Many Faces of Distances - Campinas, Brazil - 2014 7
Using Uncertainty Propagation • We treat each item as an independent Bernoulli variable with parameter 𝑞 𝑗 • Uncertainty is propagated from input data to the response- style space • Advantages ▪ Easy to compute ▪ Domain-independent interpretation ▪ Intuitive parameterization • We are investigating two approaches: ▪ Sampling-based approach ▪ First-order error propagation based approach Many Faces of Distances - Campinas, Brazil - 2014 8
Ƹ Sampling-Based Approach • Dual scaling maximizes the squared correlation ratio ( 𝜃 𝑗 2 ) of each column of the input matrix 𝐺 𝑈 𝐺 𝑈 𝐸 𝑠 −1 𝐺 𝑦 𝑗 2 = 𝑦 𝑗 𝜃 𝑗 𝑈 𝐸 𝑑 𝑦 𝑗 Input Incidence Matrix 𝐺 𝑦 𝑗 A B C D E F • We are interested in the coefficients T1 0 0 1 1 0 1 of 𝑦 𝑗 , i.e. , the location of the items in T2 1 1 1 0 1 0 response-style space ... ... ... ... ... ... ... • A sample Ƹ 𝑙 is produced using 𝑡 𝑗 TN 0 1 0 0 1 1 𝑙 = 𝑌 𝐺 −1 𝑡 𝑗 𝑙 , for 𝑌 = 𝑦 1 , 𝑦 2 , ⋯ , 𝑦 𝑛 𝑡 𝑗 • The samples of a given item define a symmetric distribution in response-style space Many Faces of Distances - Campinas, Brazil - 2014 9
** Fictitious data Sampling-Based Approach Daylight Sav.: No Oil Loss: No The samples characterize symmetric distributions around the original item in response style space Shift: After. Company: 4 Company: 1 Shift: Night Gas Loss: Yes Shift: Morn. Company: 2 Oil Loss: Yes Company: 3 Gas Loss: No Daylight Sav.: Yes Mapped samples Mapped items Many Faces of Distances - Campinas, Brazil - 2014 10
Dendrogram of Overlapping Distributions Uncertainty Items Complementary Bhattacharyya distance between distributions Sampling is computationally expensive. Define the proper number of Items samples is difficult. Many Faces of Distances - Campinas, Brazil - 2014 11
ҧ First-Order Error Propagation 𝑦 = 𝑔 𝑐 1 , 𝑐 2 , ⋯ , 𝑐 𝑙 Function that maps items to response-style space 𝑦 = 𝑔 ത 𝑐 1 , ത 𝑐 2 , ⋯ , ത 𝑐 𝑙 Compute the resulting expectation of the distribution 𝑈 Σ 𝑌 ≈ J 𝑌 Σ 𝐶 J 𝑌 Compute the resulting covariance matrix of the distribution ( J 𝑌 : Jacobian matrix of 𝑔 , Σ : covariance matrix) Many Faces of Distances - Campinas, Brazil - 2014 12
Final Remarks • Ongoing work • Contributions ▪ A divide-and-conquer approach to alleviate the combinatorial issue of association rule learning ▪ The use of uncertainty propagation to develop an intuitive parameterization for distance-based clustering o Easy to compute, domain-independent interpretation, intuitive • Synthetic and real databases ▪ ~1000 items and ~3000 transactions Many Faces of Distances - Campinas, Brazil - 2014 13
Recommend
More recommend