generalized similarity measures for text data
play

Generalized similarity measures for text data. Hubert Wagner (IST - PowerPoint PPT Presentation

Generalized similarity measures for text data. Hubert Wagner (IST Austria) Joint work with Herbert Edelsbrunner GETCO 2015, Aalborg April 9, 2015 Plan Shape of data. Text as a point-cloud. Log-transform and similarity measure.


  1. Generalized similarity measures for text data. Hubert Wagner (IST Austria) Joint work with Herbert Edelsbrunner GETCO 2015, Aalborg April 9, 2015

  2. Plan ◮ Shape of data. ◮ Text as a point-cloud. ◮ Log-transform and similarity measure. ◮ Bregman divergence and topology.

  3. Shape of data.

  4. Main tools. Rips and Cech simplicial complexes: ◮ Capture the shape of the union of balls. ◮ Combinatorial representation. Persistence captures geometric-topological information of the data: ◮ Key property: stability!

  5. Interpretation of filtration values. For a simplex S = v 0 , . . . , v k , f ( S ) = t means that at filtration threshold t , objects v 0 , . . . , v k are considered close .

  6. Text as a point-cloud.

  7. Basic concepts Corpus: ◮ (Large) collection of text documents. Term-vector: ◮ Weighted vector of key-words or terms . ◮ Summarizes the topic of a single document. ◮ Higher weight means higher importance .

  8. Concept: Vector Space Model ◮ Vector Space Model maps a corpus K to R d . ◮ Each distinct term in K becomes a direction, so d can be high (10s thousands). ◮ Each document is represented by its term-vector . Cat <(Cat,0.5), (Donkey,0.5)> y e k n T o D <(Cat,0), (Dog,0.2), (Donkey,0.9)> Dog G

  9. Concept: Similarity measures ◮ Cosine similarity compares two documents. ◮ Distance (dissimilarity) d ( a , b ) := 1 − sim ( a , b ). ◮ This d is not a metric . Cat <(Cat,0.5), (Donkey,0.5)> y e k n T o D <(Cat,0), (Dog,0.2), (Donkey,0.9)> Dog G

  10. Geometry-topological tools.

  11. Interpreting Rips A simplex is added immediately after its boundary: ◮ d ( a , b ) – the dissimilarity. ◮ For triangle d ( a , b , c ) = max ( d ( a , b ) , d ( a , c ) , d ( b , c )). ◮ Is this the filtering function we want?

  12. Generalized similarity Goal: ◮ Extend similarity from pairs to larger subsets of documents . ◮ Its persistence should be stable. ◮ As a bonus, the resulting complex will be smaller. A A A [(A,0.5), (G,0), (T,0.5)] T T T [(A,0), (G,0.2), (T,0.9)] G G G

  13. Simple example. For simplicity, let us work with binary term-vectors (or sets of terms). ◮ sim J ( X 1 , dots , X d ) = card ∩ i X i card ∪ i X i . ◮ Generalizes the Jaccard index . cat dog donkey 1 1 0 0 1 1 1 0 1

  14. New direction. Flawed generalized cosine measure: n k � � R cos ( p 0 , p 1 , . . . , p k ) = p i j . (1) j =1 i =0 Another option: the length of the geometric mean: 1 � k 2  k +1  2 � n � � R gm ( p 0 , p 1 , . . . , p k ) = p i (2) .   j j =1 i =0

  15. Log-transform We study the N-dimensional log-transform and related distances.

  16. Log-transform

  17. Log-transform in 3D

  18. Log-distance

  19. Log-distance: formula Let x , y ∈ R n − 1 , s = ( x , F 1 ( x )) and t = ( y , F 1 ( y )). Then the log-distance from x to y is D ( x , y ) = � n j =1 ( t j − s j ) e 2 t j .

  20. Log-distance: conjugate y x x * y *

  21. Log-distance: conjugate in 3D

  22. Log Ball

  23. Log Cech complex � Cech r ( X ) = { ξ ⊆ X | B r ( x ) � = ∅} . (3) x ∈ ξ

  24. Generalized measure. For each simplex ξ ∈ ∆( X ), there is a smallest radius for which ξ belongs to the ˇ Cech complex: r C ( ξ ) = min { r | ξ ∈ Cech r ( X ) } . (4) We call r C : ∆( X ) → R the ˇ Cech radius function of X . In the original coordinate space, we get the desired similarity measure: R C ( ξ ) = e − r C ( ξ ) / √ n (5)

  25. Bregman divergences

  26. Bregman divergences Bregman distance from x to y : D F ( x , y ) = F ( x ) − [ F ( y ) + �∇ F ( y ) , x − y � ] ; (6)

  27. Bregman divergences F can be any strictly convex function! ◮ It covers the Sq. Eucl. distance, squared Mahalanobis distance, Kullback-Leibler divergence, Itakura-Saito distance. ◮ Extensive use in machine learning. ◮ Links to statistics via [regular] exponential family (of distributions).

  28. Further connections ◮ Bregman-based Voronoi [Nielsen at el]. ◮ Information Geometry. ◮ Collapsibility Cech → Delunay [Bauer, Edelsbrunner]. ◮ Persistence stability for geometric complexes [Chazal, de Silva, Oudot]

  29. Summary ◮ New, stable and relevant distance (dissimilarity measure) for texts. ◮ It serves as an interpretation of text data. ◮ Link between TDA and Bregman divergences.

  30. Thank you! Research partially supported by the TOPOSYS project

Recommend


More recommend