new advances in symbolic data analysis and spatial
play

NEW ADVANCES IN SYMBOLIC DATA ANALYSIS and SPATIAL CLASSIFICATION. - PowerPoint PPT Presentation

NEW ADVANCES IN SYMBOLIC DATA ANALYSIS and SPATIAL CLASSIFICATION. Edwin Diday Paris Dauphine University Remembering Suzanne WINSBERG This talk is dedicated to her OUTLINE PART 1: SYMBOLIC DATA ANALYSIS The two levels of


  1. NEW ADVANCES IN SYMBOLIC DATA ANALYSIS and SPATIAL CLASSIFICATION. Edwin Diday Paris Dauphine University

  2. Remembering Suzanne WINSBERG This talk is dedicated to her…

  3. OUTLINE PART 1: SYMBOLIC DATA ANALYSIS • The two levels of statistical units: individuals, concepts • What are Symbolic Data? • What is Symbolic data analysis? • Why and when Symbolic Data Analysis? • Future of SDA PART 2: SPATIAL CLASSIFICATION Symbolic Data Analysis software: SODAS and SYR

  4. THE TWO LEVELS OF STATISTICAL UNITS: • INDIVIDUALS • CONCEPTS

  5. MODELED WORLD REAL WORLD Modeling INDIVIDUALS Y X Each Standard SPACE d w w X Ω Ω Ω Ω Iindividual Data Ω Ω Ω Ω ' X X Analysis X X X X X X EXT(d C / Ω ) T EXT(C/ Ω ) X X Symbolic X d C X C Data X X X X Analysis CONCEPTS X X Modeling X SPACE X X Each X X Concept

  6. BASIC IDEAS OF SDA � TWO LEVELS OF OBJECTS: o First level: Individuals o Second level: categories, classes or concepts (intent,extent) � SECOND LEVEL UNITS CAN BE CONSIDERED AS NEW STATISTICAL UNITS. � A CONCEPT IS DESCRIBED BY THE VARIATION OF THE CLASS OF INDIVIDUALS THAT IT REPRESENTS: � THIS PRODUCES SYMBOLIC DATA.

  7. FROM INDIVIDUALS TO CONCEPTS Classical : individuals Symbolic : concepts Species of birds Birds Inhabitant Regions Team (Marseille, …) Players (Zidane,…) 1 Image Type of image ( sunset,,…) 2 3 Sold clothes Shops Users Trace of WEB Usage Patients after heart attack Trajectory of patients in hospitals Mobile users Consuming level

  8. • WHAT ARE SYMBOLIC DATA?

  9. SYMBOLIC DATA TEAM OF THE WEIGHT NATIONALITY NB OF GOALS FRENCH CUP DIJON [75 , 89 ] {French} {0.8 (0), 0.2 (1)} {0.1 (0), 0.3 (1), …} LYON [80, 95] {Fr, Alg, Arg } {0.4 (0), 0.2 (1), …} PARIS-ST G. [76, 95] {Fr, Tun } {Fr, Engl, Arg } {0.2 (0), 0.5 (1), …} NANTES [70, 85] Here the variation (of weight, nationality, …) concerns the players of each team. THIS NEW KIND OF VARIABLES ARE CALLED « SYMBOLIC » BECAUSE THEY ARE NOT PURELY NUMERICAL IN ORDER TO EXPRESS THE INTERNAL VARIATION INSIDE EACH CONCEPT.

  10. How to conserve correlation and explain it? patients Region Cardiology Dentistry Town Insurance Expenses Expenses 12.5 i1 R1 3,5 Lyon1 Type 3 i2 R1 9.6 2,1 Paris3 Type 2 i3 R1 11.4 6.5 Lyon1 Type 4 i4 R2 3.2 1,6 Paris1 Type 1 i5 R2 7.1 4,8 Lyon2 Type 2 Town Insurance Cor(card, dentist) Concept Card. Expenses Dentistry Exp. R1 [9.6, 12.5] {Lyon1, Cor R1 (cardi, dent) [2.1, 6.5] Paris 3} 1 2 3 R2 [3.2, 7.1] [1.6, 4.8] Paris 3 Cor R2 (cardi, dent) 1 2 3 4 R3 [9.2, 10.1] [6.2, 8.1] Pau 1 Cor R3 (cardi, dent) R4 [5, 8.4] [7.3, 9.4] Pau 4 Cor R4 (cardi, dent) Then, a symbolic regression or symbolic decision tree can explain the correlation.

  11. Symbolic Data Table (SYR software)

  12. How to model concepts? • By the so called “Symbolic Objects”

  13. SYMBOLIC OBJECT It’s an animal(w) = 0.99 yes d d C R y w S = (a, R, d C ) a(w) = [y(w)Rd C ]

  14. TWO KINDS OF SYMBOLIC OBJECTS BOOLEAN SYMBOLIC OBJECTS S = (a, R, d1) d1= {12, 20 ,28} x {employee, worker}] R = ( ⊆ ⊆ ⊆ ⊆ , ⊆ ⊆ ⊆ ), ⊆ a(w) = [age(w) ⊆ ⊆ {12, 20 ,28}] ∧ ⊆ ⊆ ∧ ∧ ∧ [SPC(w) ⊆ ⊆ ⊆ ⊆ {employee, worker}] a(w) ∈ ∈ {TRUE, FALSE}. ∈ ∈

  15. THE MEMBERSHIP FUNCTION« a » MODAL CASE S = (a, R, d): a(w) = [age(w) R 1 {(0.2)12, (0.8) [20 ,28]}] ∧ ∧ ∧ ∧ [SPC(w) R 2 {(0.4)employee, (0.6)worker}] a(w) ∈ ∈ ∈ ∈ [0,1]. First approach: simple or flexible matching R= ( R 1 , R 2 ): r R i q = � � � � j=1 ,k r j q j e (r j - min (r j, q j)) . Second approach: Probabilistic: if dependencies, copulas,

  16. QUALITY CONTROL CONFIRMATORY SDA REAL WORLD MODELED WORLD DESCRIPTIONS INDIVIDUALS xw x d w Ω R x x x x x x x T d C Ext(s/ Ω Ω Ω Ω ) SYMBOLIC OBJECTS CONCEPTS x x x s = (a, R,d C ) x

  17. Lattice obtained from THE SYMBOLIC DATA TABLE the symbolic Data Table Y1 Y2 Y3 ∅ ∅ ∅ ∅ W1 {a, b} {g} W2 ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ {g, h} W3 {c} {e, f} {g, h, i} W4 {a, b, c} {e} {h} Symbolic objects obtained From the Symbolic Lattice. s 2 : a 2 (w)= [y 2 (w) ⊆ ⊆ {e} ] ∧ ∧ [y 3 (w) ⊆ ⊆ {g,h} ], ⊆ ⊆ ∧ ∧ ⊆ ⊆ Ext(s 2 ) = {1, 2, 4} s 3 : a 3 (w) = [y 1 (w) ⊆ ⊆ ⊆ {c} ], ⊆ Ext(s 3 ) = {2, 3} s 4 : a 4 (w) = [ y 1 (w) ⊆ ⊆ {a,b} ] ∧ ∧ [ y 2 (w) = ∅ ∅ ] ∧ ∧ [ y 3 (w) ⊆ ⊆ ⊆ ⊆ ∧ ∧ ∅ ∅ ∧ ∧ ⊆ ⊆ {g,h} ], Ext(s 4 ) = {1, 2}

  18. • WHAT IS SYMBOLIC DATA ANALYSIS? TO EXTEND STATISTICS AND DATA MINING TO SYMBOLIC DATA TABLES DESCRIBING HIGHER LEVEL UNITS NEEDING VARIATION IN THEIR DESCRIPTION.

  19. SYMBOLIC DATA ANALYSIS TOOLS HAVE BEEN DEVELOPPED - Graphical visualisation of Symbolic Data - Correlation, Mean, Mean Square Histogram of a symbolic variable - Dissimilarities between symbolic descriptions - Clustering of symbolic descriptions - S-Kohonen Mappings - S-Decision Trees - S-Principal Component Analysis - S-Discriminant Factorial Analysis - S-Regression - Etc...

  20. Why Symbolic Data Analysis? 1) From standard statistical units to concepts, the statistic is not the same! 2) Symbolic Data cannot be reduced to classical data!

  21. From standard statistical units to concepts, The statistic is not the same! On an island : Three species of 600 birds together: 400 swallows, 100 ostriches, 100 penguins . Symbolic Data Table Species Fly Couleur Taille Migr Bird Species Flying Size (cm) yes 0.3b,0.7grey [25, 35] Yes swallows penguins 1 No 80 ostriche No 0.1black,0.9g [85,160] No 2 yes 30 swallows No 0.5b,0.5grey [70, 95] Yes Penguin 600 ostriches No 125 The species are the new units swallows, ostriches, The variation due to the « Migration » is and penguins are the individuals of each species an added variable “concepts” at the produces symbolic data « concepts » Species Oiseaux level. 400 2 Frequencies Frequencies of of concepts 1 200 individuals (species) Flying Not Flying Flying Not Flying

  22. WHY SYMBOLIC DATA CANNOT BE REDUCED TO CLASSICAL DATA? Symbolic Data Table Players category Weight Size Nationality Very good [80, 95] [1.70, 1.95] {0.7 Eur, 0.3 Afr} Transformation in classical data Players Poids Poids Taille Taille Eur Afr category Min Max Min Max Very good 80 95 1.70 1.95 0. 7 0.3 Concern: The initial variables are lost and the variation is lost!

  23. Divisive Clustering or Decision tree Symbolic Analysis Classical Analysis Weight Max Weight

  24. Classique / symbolique : une comparaison Arbres de décision établis sur 1000 données initiales (patients) que l’on veut regrouper en classes homogènes suivant une même trajectoire d’hospitalisation. Variable à expliquer (ex. la mortalité) et des variables explicatives cliniques-biologiques. Arbre « classique » sur les patients Arbre « symbolique » sur les trajectoires Variables explicatives Variable à expliquer (le niveau de mortalité) En moins de branches , moins de nœuds et avec une meilleure discrimination , l’arbre symbolique permet d’obtenir des classes de patients plus homogènes et clairement expliquées vis-à-vis de la variable « mortalité ».

  25. Symbolic Principal Symbolic correlation Component Analysis x x x x x x

  26. Symbolic Analysis Classical Analysis

  27. WHEN SYMBOLIC DATA ANALYSIS? • When the good units are the concepts: When the good units are the concepts: • finding why a team is a winner is not finding finding why a team is a winner is not finding why a player is a winner why a player is a winner • When the categories of the class variable to When the categories of the class variable to • explain are considered as new units and explain are considered as new units and described by explanatory symbolic variables. described by explanatory symbolic variables. • When the initial data are composed by When the initial data are composed by • multisource data tables and then their fusion is multisource data tables and then their fusion is needed needed

  28. FRANCE IS DIVIDED INTO 50 000 COUNTIES CALLED IRIS IRIS are the level to study, initial data are confidential and multisource IRIS are the level to study, initial data are confidential and m ultisource Classical Data table Classical Data table School IRIS Type Household Size Car Mark SPC IRIS Condorcet IRIS 605 Private Dupont 2 Renault 3 IRIS 55 Laplace IRIS 75 Public Durand 5 Renault 1 IRIS 602 Voltaire IRIS 855 Public 3 Peugeot 2 Boule IRIS 498 Symbolic description of London by the school variables Symbolic description of London by the household variables IRIS Statut Spécialisation IRIS Size Localisation SPC IRIS 1 {(private, {(yes,17%); 37%);(public, (no, 83%)} 63%)} IRIS 1 [0, 5] Renault(43%), Citroën (21%)…. Concatenation IRIS n = [Symb. Description of households] ∧ ∧ ∧ ∧ [

  29. Adding Data to a SYMBOLIC FILE Example: Social Security Insurance Provider Provider Medical act Refund rate Refund rate Medical act Action Action dental 315.000 Rows dental 315.000 Rows (actions) (actions) optical optical Medical act Medical act Refund rate Provider Gender Gender Age Provider Refund rate Provider Provider Age 19.000 Rows 19.000 Rows Providers) ( Providers) (

Recommend


More recommend