visualisation and exploration of high dimensional data
play

Visualisation and exploration of high-dimensional data using a - PowerPoint PPT Presentation

Visualisation and exploration of high-dimensional data using a "force directed placement" method Application to the analysis of genomic signatures. Sylvain Lespinats, Alain Giron, Bernard Fertil Unit INSERM 494, CHU


  1. Visualisation and exploration of high-dimensional data using a "force directed placement" method Application to the analysis of genomic signatures. Sylvain Lespinats, Alain Giron, Bernard Fertil Unité INSERM 494, CHU Pitié-Salpétrière ASMDA 2005, Brest, 17-20 mai 91 bd de l'hôpital, 75634 PARIS (France)

  2. topics • FDP-MDS mapping Method • typical examples of mapping • The genomic signatures’ world

  3. FDP-MDS Goal distance Matrix Distances between data are available High dimensional Data to be displayed build a biased data representation in a low dimensional space (ex: preference for small distances)

  4. FDP-MDS Algorithm Keywords: Multi-Dimensional Scaling Curvilinear Component Analysis Force Directed Placement Prototypes • Step 1 : selection of prototypes • Step 2 : Placement of prototypes • Step 3 : Space learning

  5. Step 1 : selection of prototypes

  6. Step 1 : Selection of prototypes Choice of «� Seeds� » : Hastie T. et al, (2001) «� The Elements of Statistical Learning. Data mining, inference, and Prediction » Springerseries in statistics.

  7. Step 1 : Selection of prototypes

  8. Step 1 : Selection of prototypes

  9. Step 1 : Selection of prototypes

  10. Step 1 : Selection of prototypes

  11. Step 1 : Selection of prototypes

  12. Step 1 : Selection of prototypes

  13. Step 1 : Selection of prototypes

  14. Step 2 : new prototype handling thanks to Force Directed Placement F 2 2 F 3 1 F 1 = k 1 ( d 1 n - L 1 n ) u 1 n N F 2 = k 2 ( d 2 n - L 2 n ) u 2 n F 1 k : spring stiffness F 3 = k 3 ( d 3 n - L 3 n ) u 3 n d : distance in output space L : distance in data space 3 u : unit vector k i depends on L ij in order to favor “some” distances

  15. Step 2 : new prototype handling thanks to Force Directed Placement 2 1 F 3 F 2 F = F 1 + F 2 + F 3 N F 1 3 Algebric sum of forces: temperature => additional ramdomly oriented force Vectorial sum of forces: movement

  16. Distances between data as a function of dimension 2500 dim = 1 2000 dim = 2 dim = 5 dim = 10 dim = 20 1500 dim = 50 1000 dim = 200 500 0 0 5 10 15 20 25 Histogram of distances between data randomly generated in spaces with various dimensions. Relative variation of distances between data decreases when dimension of space increases

  17. Favoring representation of short distances 2500 2000 s s 1500 e n f 1000 f i t dim = 200 s 500 0 0 5 10 15 20 25 Colored curves : The stiffness of springs is function of distance between data • The closest the data, the strongest the stiffness • The steepness of curve rules the neighborhood range Demartines P. et al, (1997) «� Curvilinear Component Analysis: A self-or ganizing neural network for nonlinear mapping of data sets� ». IEEE Transaction on Neural Networks 8: (1) 148-154, January 1997.

  18. Dimension réduction mapping of a 3 dimensional open box on a plan

  19. FDP-MDS projection is non-linear Quality of fitness Two boxes (3D) with open side pointing in different directions are projected in a two dimensional space.

  20. 13 180 12 4 160 distance entre les points sur la projection 11 140 10 -6 -4 -2 0 2 4 6 3 9 120 8 100 7 2 6 80 5 -6 -4 -2 0 2 4 6 60 1 4 40 3 2 20 0 1 0 0 1 2 3 4 5 0.5 1 distance entre les points 0 0 dans l’espace d’origine -0.5 -1 160 6 140 3 distance entre les points sur la projection 3 5 120 2 2 4 100 1 1 80 0 3 0 60 -1 -1 2 40 -2 -2 1 20 -3 -3 -2 0 2 0 -2 0 2 0 1 3 4 2 distance entre les points dans l’espace d’origine

  21. 10 5 10 0 5 -5 0 5 -10 5 0 0 -5 -15 -10 -5 0 5 10 -5 distance entre les points sur la projection 25 10 150 20 5 100 15 0 10 -5 50 5 -10 0 10 15 -15 -10 -5 0 5 10 5 distance entre les points dans l’espace d’origine

  22. 10 8 6 4 5 2 0 0 -2 -4 -5 -6 5 -8 0 2 0 -5 -10 -2 -10 -5 0 5 10 20 120 10 18 8 distance entre les points sur la projection 100 16 6 14 4 80 12 2 10 60 0 8 -2 40 6 -4 4 -6 20 2 -8 0 -10 0 2 4 6 8 10 12 14 16 18 -10 -5 0 5 10 distance entre les points dans l’espace d’origine

  23. 2-dimensional mapping of cities Wellington Christchurch 5000 km Auckland Sydney Melbourne Projection non linéaire Brisbane Adelaide Port Moresby Alice Springs Perth Darwin Manokwari Kupang Tokyo Makassar Honolulu Sapporo Fukuoka Manila Surabaya Seoul Taipei Tainan Petropavlovsk Vladivostok Pyongyang Shanghai Hong Kong Djakarta Da Nang Hanoï Palembang Hanoi Singapore Chongquing Phnom Penh Kuala Lumpur Penang Bangkok Medan Mandalay Anchorage Rangoon Krasnoyarsk Chittagong Calcutta Vancouver Seattle Katmandu Portland Jasper Alma Ata Madras Nagpur San Francisco Calgary New Delhi Colombo Sverdlovsk Idaho Falls Tashkent Peshwar Bangalore Lahore Los Angeles Churchill Salt Lake City Archangel San Diego Las Vegas Kabul Ahmenabad Bombay Kuibyshev Flagstaff Winnipeg Cheyenne Phoenix Tucson Narvik Leningrad Denver Albuquerque Santa Fe Karachi Volgograd Minneapolis Meshed Dodge City Helsinki Moscow Des Moines Kansas City Stockholm Kharkov Columbia Chicago Narsarssuaq Oslo Minsk Tehran Fort Smith Springfield Reykjavik Rostov on Don Dallas St. Louis Detroit Toronto Kaliningrad Kiev Tbilisi Indianapolis Copenhagen Austin Ottawa Guadalajara Richmond Quebec Warsaw Abadan Memphis Lexington Cleveland Montreal Monterrey Houston Edinburgh Hamburg Berlin Krakow Odessa Albany Portland Mosul Dhahran Baltimore Baghdad Huntsville Charlottesville Washington Philadelphia Boston Glasgow Belfast Amsterdam Hannover Prague Budapest Bucharest Ankara New Orleans Charlotte New York Dublin Birmingham Brussels Vienna Adana Mexico City Atlantic City Istanbul Richmond Halifax London Mannheim Munich Damascus Albany Shannon Cardiff Strasbourg Belgrade Sofia Riyadh Charleston Paris Zurich Beirut Amman Thessaloniki Izmir Algiers Tel Aviv Vera Cruz Milan Jerusalem Merida Orlando Nantes Lyon Nice Rome Athens Naples Miami Kindley Marseilles Guatemala City Belize Cairo Jedda Havana Barcelona Nassau San Salvador Madrid Tunis Aden Tegucigalpa Benghazi Asmara Lajes Lisbon Valencia Guantanamo Bay Managua Gibraltar Port Au Prince Santo Domingo Khartoum Mogadiscio Addis Ababa Casablanca Panama City San Juan Baranquilla Maracaibo Medellin Caracas Fort De France Nairobi Cali Quito Bogota Port of Spain Dar es Salaam Guayaquil Georgetown Cayenne Tananarive Paramaribo Kisangani Dakar Lagos Brazzaville Lima Belem Abidjan Kinshasa Accra Fortaleza Monrovia La Paz Pretoria Recife Johannesburg Brasilia Salvador Tucuman Ascuncion Sao Paulo Belo Horizonte Cape Town Cordoba Valparaiso Curitiba Rio De Janeiro Santiago Porto Alegre Buenos Aires Montevideo Punta Arenas

  24. FDP-MDS Robustness: noise reference randomly located Prototypes

  25. FDP-MDS Robustness: local minimum Reference A prototype is trapped in a local minima

  26. Accounting for neighborhood Distant points can be more or less considered

  27. Frequencies of oligonucleotides (words): Génomic signature 2-letter word 3 4 6-letters word High frequency CCC CC GC CG ACC G C CAC TC AC AAC CCA CA A T AA TT Low frequency

  28. A fulgidus B subtilis C elegans C jejeuni T pallidum D radiodurans E coli H pylori H influenzae Synechocystis sp H sapiens M jannaschii P horikoshi R prowazekii S cerevisiae

  29. Principal genomic signatures mapping Component Analysis 2 ème composante 2nd component VIRUS 1 ère composante 1st component

  30. Deltaproteobacteria genomic signatures Eukaryota Betaproteobacteria Firmicutes Gammaproteobacteria Clostridia Alphaproteobacteria Clostridiales mapping Alveolata Lactobacillales Proteobacteria Mollicutes Bacteroidetes Bacillales Fungi Cyanobacteria Microsporidia Basidiomycota Actinobacteria; Actinobacteridae; Actinomycetales Oscillatoriales Chroococcales Zygomycota;Zygomycetes; Mucorales;Mucoraceae Chlamydiae; Ascomycota Chlamydiales Nostocales - Each point is associated to a species Euglenozoa Fusobacteria;Fusobacterales; Fusobacteriaceae; Viridiplantae Rhodophyta Streptophyta Deinococcus-Thermus; - Color depends on taxonomy Thermotogae;Thermotogales; Spirochaetes; Thermotogaceae Deinococci Spirochaetales; Chlorophyta Parabasalidea;Trichomonadida; Trichomonadidae;Tritrichomonadinae Bacteria Chordata stramenopiles Nematoda;Chromadorea Craniata;Vertebrata; Euteleostomi; Archaea Mollusca Arthropoda Metazoa Euryarchaeota Crenarchaeota Thermoprotei Axis 2 Axis 1 2D FDP-MDS mapping of the tree of life PCA poorly displays taxonomic groups Principal Component Analysis

Recommend


More recommend