Visualisation and exploration of high-dimensional data using a "force directed placement" method Application to the analysis of genomic signatures. Sylvain Lespinats, Alain Giron, Bernard Fertil Unité INSERM 494, CHU Pitié-Salpétrière ASMDA 2005, Brest, 17-20 mai 91 bd de l'hôpital, 75634 PARIS (France)
topics • FDP-MDS mapping Method • typical examples of mapping • The genomic signatures’ world
FDP-MDS Goal distance Matrix Distances between data are available High dimensional Data to be displayed build a biased data representation in a low dimensional space (ex: preference for small distances)
FDP-MDS Algorithm Keywords: Multi-Dimensional Scaling Curvilinear Component Analysis Force Directed Placement Prototypes • Step 1 : selection of prototypes • Step 2 : Placement of prototypes • Step 3 : Space learning
Step 1 : selection of prototypes
Step 1 : Selection of prototypes Choice of «� Seeds� » : Hastie T. et al, (2001) «� The Elements of Statistical Learning. Data mining, inference, and Prediction » Springerseries in statistics.
Step 1 : Selection of prototypes
Step 1 : Selection of prototypes
Step 1 : Selection of prototypes
Step 1 : Selection of prototypes
Step 1 : Selection of prototypes
Step 1 : Selection of prototypes
Step 1 : Selection of prototypes
Step 2 : new prototype handling thanks to Force Directed Placement F 2 2 F 3 1 F 1 = k 1 ( d 1 n - L 1 n ) u 1 n N F 2 = k 2 ( d 2 n - L 2 n ) u 2 n F 1 k : spring stiffness F 3 = k 3 ( d 3 n - L 3 n ) u 3 n d : distance in output space L : distance in data space 3 u : unit vector k i depends on L ij in order to favor “some” distances
Step 2 : new prototype handling thanks to Force Directed Placement 2 1 F 3 F 2 F = F 1 + F 2 + F 3 N F 1 3 Algebric sum of forces: temperature => additional ramdomly oriented force Vectorial sum of forces: movement
Distances between data as a function of dimension 2500 dim = 1 2000 dim = 2 dim = 5 dim = 10 dim = 20 1500 dim = 50 1000 dim = 200 500 0 0 5 10 15 20 25 Histogram of distances between data randomly generated in spaces with various dimensions. Relative variation of distances between data decreases when dimension of space increases
Favoring representation of short distances 2500 2000 s s 1500 e n f 1000 f i t dim = 200 s 500 0 0 5 10 15 20 25 Colored curves : The stiffness of springs is function of distance between data • The closest the data, the strongest the stiffness • The steepness of curve rules the neighborhood range Demartines P. et al, (1997) «� Curvilinear Component Analysis: A self-or ganizing neural network for nonlinear mapping of data sets� ». IEEE Transaction on Neural Networks 8: (1) 148-154, January 1997.
Dimension réduction mapping of a 3 dimensional open box on a plan
FDP-MDS projection is non-linear Quality of fitness Two boxes (3D) with open side pointing in different directions are projected in a two dimensional space.
13 180 12 4 160 distance entre les points sur la projection 11 140 10 -6 -4 -2 0 2 4 6 3 9 120 8 100 7 2 6 80 5 -6 -4 -2 0 2 4 6 60 1 4 40 3 2 20 0 1 0 0 1 2 3 4 5 0.5 1 distance entre les points 0 0 dans l’espace d’origine -0.5 -1 160 6 140 3 distance entre les points sur la projection 3 5 120 2 2 4 100 1 1 80 0 3 0 60 -1 -1 2 40 -2 -2 1 20 -3 -3 -2 0 2 0 -2 0 2 0 1 3 4 2 distance entre les points dans l’espace d’origine
10 5 10 0 5 -5 0 5 -10 5 0 0 -5 -15 -10 -5 0 5 10 -5 distance entre les points sur la projection 25 10 150 20 5 100 15 0 10 -5 50 5 -10 0 10 15 -15 -10 -5 0 5 10 5 distance entre les points dans l’espace d’origine
10 8 6 4 5 2 0 0 -2 -4 -5 -6 5 -8 0 2 0 -5 -10 -2 -10 -5 0 5 10 20 120 10 18 8 distance entre les points sur la projection 100 16 6 14 4 80 12 2 10 60 0 8 -2 40 6 -4 4 -6 20 2 -8 0 -10 0 2 4 6 8 10 12 14 16 18 -10 -5 0 5 10 distance entre les points dans l’espace d’origine
2-dimensional mapping of cities Wellington Christchurch 5000 km Auckland Sydney Melbourne Projection non linéaire Brisbane Adelaide Port Moresby Alice Springs Perth Darwin Manokwari Kupang Tokyo Makassar Honolulu Sapporo Fukuoka Manila Surabaya Seoul Taipei Tainan Petropavlovsk Vladivostok Pyongyang Shanghai Hong Kong Djakarta Da Nang Hanoï Palembang Hanoi Singapore Chongquing Phnom Penh Kuala Lumpur Penang Bangkok Medan Mandalay Anchorage Rangoon Krasnoyarsk Chittagong Calcutta Vancouver Seattle Katmandu Portland Jasper Alma Ata Madras Nagpur San Francisco Calgary New Delhi Colombo Sverdlovsk Idaho Falls Tashkent Peshwar Bangalore Lahore Los Angeles Churchill Salt Lake City Archangel San Diego Las Vegas Kabul Ahmenabad Bombay Kuibyshev Flagstaff Winnipeg Cheyenne Phoenix Tucson Narvik Leningrad Denver Albuquerque Santa Fe Karachi Volgograd Minneapolis Meshed Dodge City Helsinki Moscow Des Moines Kansas City Stockholm Kharkov Columbia Chicago Narsarssuaq Oslo Minsk Tehran Fort Smith Springfield Reykjavik Rostov on Don Dallas St. Louis Detroit Toronto Kaliningrad Kiev Tbilisi Indianapolis Copenhagen Austin Ottawa Guadalajara Richmond Quebec Warsaw Abadan Memphis Lexington Cleveland Montreal Monterrey Houston Edinburgh Hamburg Berlin Krakow Odessa Albany Portland Mosul Dhahran Baltimore Baghdad Huntsville Charlottesville Washington Philadelphia Boston Glasgow Belfast Amsterdam Hannover Prague Budapest Bucharest Ankara New Orleans Charlotte New York Dublin Birmingham Brussels Vienna Adana Mexico City Atlantic City Istanbul Richmond Halifax London Mannheim Munich Damascus Albany Shannon Cardiff Strasbourg Belgrade Sofia Riyadh Charleston Paris Zurich Beirut Amman Thessaloniki Izmir Algiers Tel Aviv Vera Cruz Milan Jerusalem Merida Orlando Nantes Lyon Nice Rome Athens Naples Miami Kindley Marseilles Guatemala City Belize Cairo Jedda Havana Barcelona Nassau San Salvador Madrid Tunis Aden Tegucigalpa Benghazi Asmara Lajes Lisbon Valencia Guantanamo Bay Managua Gibraltar Port Au Prince Santo Domingo Khartoum Mogadiscio Addis Ababa Casablanca Panama City San Juan Baranquilla Maracaibo Medellin Caracas Fort De France Nairobi Cali Quito Bogota Port of Spain Dar es Salaam Guayaquil Georgetown Cayenne Tananarive Paramaribo Kisangani Dakar Lagos Brazzaville Lima Belem Abidjan Kinshasa Accra Fortaleza Monrovia La Paz Pretoria Recife Johannesburg Brasilia Salvador Tucuman Ascuncion Sao Paulo Belo Horizonte Cape Town Cordoba Valparaiso Curitiba Rio De Janeiro Santiago Porto Alegre Buenos Aires Montevideo Punta Arenas
FDP-MDS Robustness: noise reference randomly located Prototypes
FDP-MDS Robustness: local minimum Reference A prototype is trapped in a local minima
Accounting for neighborhood Distant points can be more or less considered
Frequencies of oligonucleotides (words): Génomic signature 2-letter word 3 4 6-letters word High frequency CCC CC GC CG ACC G C CAC TC AC AAC CCA CA A T AA TT Low frequency
A fulgidus B subtilis C elegans C jejeuni T pallidum D radiodurans E coli H pylori H influenzae Synechocystis sp H sapiens M jannaschii P horikoshi R prowazekii S cerevisiae
Principal genomic signatures mapping Component Analysis 2 ème composante 2nd component VIRUS 1 ère composante 1st component
Deltaproteobacteria genomic signatures Eukaryota Betaproteobacteria Firmicutes Gammaproteobacteria Clostridia Alphaproteobacteria Clostridiales mapping Alveolata Lactobacillales Proteobacteria Mollicutes Bacteroidetes Bacillales Fungi Cyanobacteria Microsporidia Basidiomycota Actinobacteria; Actinobacteridae; Actinomycetales Oscillatoriales Chroococcales Zygomycota;Zygomycetes; Mucorales;Mucoraceae Chlamydiae; Ascomycota Chlamydiales Nostocales - Each point is associated to a species Euglenozoa Fusobacteria;Fusobacterales; Fusobacteriaceae; Viridiplantae Rhodophyta Streptophyta Deinococcus-Thermus; - Color depends on taxonomy Thermotogae;Thermotogales; Spirochaetes; Thermotogaceae Deinococci Spirochaetales; Chlorophyta Parabasalidea;Trichomonadida; Trichomonadidae;Tritrichomonadinae Bacteria Chordata stramenopiles Nematoda;Chromadorea Craniata;Vertebrata; Euteleostomi; Archaea Mollusca Arthropoda Metazoa Euryarchaeota Crenarchaeota Thermoprotei Axis 2 Axis 1 2D FDP-MDS mapping of the tree of life PCA poorly displays taxonomic groups Principal Component Analysis
Recommend
More recommend