intuitive parameterization of distance based clustering
play

Intuitive Parameterization of Distance-Based Clustering Techniques - PowerPoint PPT Presentation

Intuitive Parameterization of Distance-Based Clustering Techniques Altobelli de Brito Mantuan Leandro A. F. Fernandes amantuan@ic.uff.br laffernandes@ic.uff.br Many Faces of Distances - Campinas, Brazil - 2014 Conventional Pipeline Input


  1. Intuitive Parameterization of Distance-Based Clustering Techniques Altobelli de Brito Mantuan Leandro A. F. Fernandes amantuan@ic.uff.br laffernandes@ic.uff.br Many Faces of Distances - Campinas, Brazil - 2014

  2. Conventional Pipeline Input Incidence Matrix Mined Rules A B C D E F T1 0 0 1 1 0 1 {A, C, F} → {B} Apriori T2 1 1 1 0 1 0 {A, D} → {F} ... ... ... ... ... ... ... TN 0 1 0 0 1 1 The typical example Transaction ID Milk Bread Butter Beer T1 1 1 0 0 T2 0 0 1 0 T3 0 0 0 1 T4 1 1 1 0 Performance T5 0 1 0 0 and scalability issues Association Rule: {Butter, Bread} → {Milk} Support = 20% Confidence = 50% Many Faces of Distances - Campinas, Brazil - 2014 2

  3. Proposed Approach Input Incidence Matrix Mined Rules A B C D E F T1 0 0 1 1 0 1 {A, C, F} → {B} Apriori T2 1 1 1 0 1 0 {A, D} → {F} ... ... ... ... ... ... ... TN 0 1 0 0 1 1 Clustering & Pruning Euclidean distance is not intuitive to parameterize Response Dual Style Space Scaling clustering techniques Many Faces of Distances - Campinas, Brazil - 2014 3

  4. Defining the Response Style Space • Dual scaling [Nishisato 1993] ▪ Versatile method typically applied in marketing research ▪ Analysis of preferences of human subjects ▪ Graphical representation of o Response-style patterns among surveyed transition o The preference over a set of item Nishisato, S. “On quantifying different types of categorical data”. Many Faces of Distances - Campinas, Brazil - 2014 4 In: Psychometrika. 58(4), pp.617-629, 1993.

  5. ** Fictitious data Points in Response Style Space A space where transition and item T2 are represented as points T12 E T19 T3 T21 T22 T8 T27 T T18 T10 Transition T is more related to T25 ( i.e. , prefers) item A than to E T1 T4 T11 A T13 T26 T15 T5 T17 T9 T7 T16 T14 T6 T20 T24 Mapped transactions Mapped items Many Faces of Distances - Campinas, Brazil - 2014 5

  6. ** Fictitious data Emerging Contexts A context emerges from the existence of groups of items having similar preferences A set of transition with similar preferences Elements in the same context are likely to be part of significant itemsets Mapped transactions Mapped items Many Faces of Distances - Campinas, Brazil - 2014 6

  7. Dendrogram of Items in Response Style Space Euclidean distance in reponse-style space Items Many Faces of Distances - Campinas, Brazil - 2014 7

  8. Using Uncertainty Propagation • We treat each item as an independent Bernoulli variable with parameter 𝑞 𝑗 • Uncertainty is propagated from input data to the response- style space • Advantages ▪ Easy to compute ▪ Domain-independent interpretation ▪ Intuitive parameterization • We are investigating two approaches: ▪ Sampling-based approach ▪ First-order error propagation based approach Many Faces of Distances - Campinas, Brazil - 2014 8

  9. Ƹ Sampling-Based Approach • Dual scaling maximizes the squared correlation ratio ( 𝜃 𝑗 2 ) of each column of the input matrix 𝐺 𝑈 𝐺 𝑈 𝐸 𝑠 −1 𝐺 𝑦 𝑗 2 = 𝑦 𝑗 𝜃 𝑗 𝑈 𝐸 𝑑 𝑦 𝑗 Input Incidence Matrix 𝐺 𝑦 𝑗 A B C D E F • We are interested in the coefficients T1 0 0 1 1 0 1 of 𝑦 𝑗 , i.e. , the location of the items in T2 1 1 1 0 1 0 response-style space ... ... ... ... ... ... ... • A sample Ƹ 𝑙 is produced using 𝑡 𝑗 TN 0 1 0 0 1 1 𝑙 = 𝑌 𝐺 −1 𝑡 𝑗 𝑙 , for 𝑌 = 𝑦 1 , 𝑦 2 , ⋯ , 𝑦 𝑛 𝑡 𝑗 • The samples of a given item define a symmetric distribution in response-style space Many Faces of Distances - Campinas, Brazil - 2014 9

  10. ** Fictitious data Sampling-Based Approach Daylight Sav.: No Oil Loss: No The samples characterize symmetric distributions around the original item in response style space Shift: After. Company: 4 Company: 1 Shift: Night Gas Loss: Yes Shift: Morn. Company: 2 Oil Loss: Yes Company: 3 Gas Loss: No Daylight Sav.: Yes Mapped samples Mapped items Many Faces of Distances - Campinas, Brazil - 2014 10

  11. Dendrogram of Overlapping Distributions Uncertainty Items Complementary Bhattacharyya distance between distributions Sampling is computationally expensive. Define the proper number of Items samples is difficult. Many Faces of Distances - Campinas, Brazil - 2014 11

  12. ҧ First-Order Error Propagation 𝑦 = 𝑔 𝑐 1 , 𝑐 2 , ⋯ , 𝑐 𝑙 Function that maps items to response-style space 𝑦 = 𝑔 ത 𝑐 1 , ത 𝑐 2 , ⋯ , ത 𝑐 𝑙 Compute the resulting expectation of the distribution 𝑈 Σ 𝑌 ≈ J 𝑌 Σ 𝐶 J 𝑌 Compute the resulting covariance matrix of the distribution ( J 𝑌 : Jacobian matrix of 𝑔 , Σ : covariance matrix) Many Faces of Distances - Campinas, Brazil - 2014 12

  13. Final Remarks • Ongoing work • Contributions ▪ A divide-and-conquer approach to alleviate the combinatorial issue of association rule learning ▪ The use of uncertainty propagation to develop an intuitive parameterization for distance-based clustering o Easy to compute, domain-independent interpretation, intuitive • Synthetic and real databases ▪ ~1000 items and ~3000 transactions Many Faces of Distances - Campinas, Brazil - 2014 13

Recommend


More recommend