Visualisation and exploration of high-dimensional data using a - - PowerPoint PPT Presentation

visualisation and exploration of high dimensional data
SMART_READER_LITE
LIVE PREVIEW

Visualisation and exploration of high-dimensional data using a - - PowerPoint PPT Presentation

Visualisation and exploration of high-dimensional data using a "force directed placement" method Application to the analysis of genomic signatures. Sylvain Lespinats, Alain Giron, Bernard Fertil Unit INSERM 494, CHU


slide-1
SLIDE 1

Visualisation and exploration of high-dimensional data using a "force directed placement" method Application to the analysis of genomic signatures. Sylvain Lespinats, Alain Giron, Bernard Fertil

ASMDA 2005, Brest, 17-20 mai

Unité INSERM 494, CHU Pitié-Salpétrière 91 bd de l'hôpital, 75634 PARIS (France)

slide-2
SLIDE 2
  • FDP-MDS mapping Method
  • typical examples of mapping
  • The genomic signatures’ world

topics

slide-3
SLIDE 3

Distances between data are available

FDP-MDS Goal

build a biased data representation in a low dimensional space (ex: preference for small distances)

High dimensional Data to be displayed

distance Matrix

slide-4
SLIDE 4

Keywords: Multi-Dimensional Scaling Curvilinear Component Analysis Force Directed Placement Prototypes

  • Step 1 : selection of prototypes
  • Step 2 : Placement of prototypes
  • Step 3 : Space learning

FDP-MDS Algorithm

slide-5
SLIDE 5

Step 1 : selection of prototypes

slide-6
SLIDE 6

Step 1 : Selection of prototypes

Choice of « Seeds » : Hastie T. et al, (2001) «

The Elements of Statistical Learning. Data mining, inference, and Prediction » Springerseries in statistics.

slide-7
SLIDE 7

Step 1 : Selection of prototypes

slide-8
SLIDE 8

Step 1 : Selection of prototypes

slide-9
SLIDE 9

Step 1 : Selection of prototypes

slide-10
SLIDE 10

Step 1 : Selection of prototypes

slide-11
SLIDE 11

Step 1 : Selection of prototypes

slide-12
SLIDE 12

Step 1 : Selection of prototypes

slide-13
SLIDE 13

Step 1 : Selection of prototypes

slide-14
SLIDE 14

1 2 3 N

F

1 = k 1(d1n - L 1n)u1n

F

3 = k3(d3n - L3n)u3n

k : spring stiffness d : distance in output space L : distance in data space u : unit vector

F

2 = k2(d2n - L2n)u2n

ki depends on Lij in order to favor “some” distances F1 F2 F3

Step 2 : new prototype handling thanks to Force Directed Placement

slide-15
SLIDE 15

1 2 3 N F = F1 + F2 + F3

F3 F1 F2

Step 2 : new prototype handling thanks to Force Directed Placement

Algebric sum of forces: temperature => additional ramdomly oriented force Vectorial sum of forces: movement

slide-16
SLIDE 16

Distances between data as a function of dimension

500 1000 1500 2000 2500 5 10 15 20 25

dim = 1 dim = 2 dim = 5 dim = 10 dim = 20 dim = 50 dim = 200

Relative variation of distances between data decreases when dimension of space increases

Histogram of distances between data randomly generated in spaces with various dimensions.

slide-17
SLIDE 17

Favoring representation of short distances

500 1000 1500 2000 2500 5 10 15 20 25

dim = 200

Colored curves : The stiffness of springs is function of distance between data

  • The closest the data, the strongest the stiffness
  • The steepness of curve rules the neighborhood range

s t i f f n e s s

Demartines P. et al, (1997) « Curvilinear Component Analysis: A self-or ganizing neural network for nonlinear mapping of data sets ». IEEE Transaction on Neural Networks 8: (1) 148-154, January 1997.

slide-18
SLIDE 18

Dimension réduction

mapping of a 3 dimensional open box on a plan

slide-19
SLIDE 19
slide-20
SLIDE 20

FDP-MDS projection is non-linear

Two boxes (3D) with open side pointing in different directions are projected in a two dimensional space.

Quality of fitness

slide-21
SLIDE 21
  • 6
  • 4
  • 2

2 4 6

  • 6
  • 4
  • 2

2 4 6

  • 1

1

  • 0.5

0.5 1 2 3 4

  • 2

2

  • 3
  • 2
  • 1

1 2 3

  • 2

2

  • 3
  • 2
  • 1

1 2 3 20 40 60 80 100 120 140 160 1 1 2 3 3 4 4 2 5 6

distance entre les points dans l’espace d’origine distance entre les points sur la projection 13 8 3

20 40 60 80 100 120 140 160 180

distance entre les points dans l’espace d’origine distance entre les points sur la projection 1 2 3 4 5 12 11 10 9 7 6 5 4 2 1

slide-22
SLIDE 22
  • 15
  • 10
  • 5

5 10

  • 10
  • 5

5 10

  • 5

5

  • 5

5 5 10

  • 15
  • 10
  • 5

5 10

  • 10
  • 5

5 10 50 100 150

5 10 10 5 15 15 20 25 distance entre les points dans l’espace d’origine distance entre les points sur la projection

slide-23
SLIDE 23
  • 10
  • 5

5 10

  • 10
  • 8
  • 6
  • 4
  • 2

2 4 6 8 10

  • 10
  • 5

5 10

  • 10
  • 8
  • 6
  • 4
  • 2

2 4 6 8 10

  • 5

5

  • 2

2

  • 5

5 20 40 60 80 100 120 2 4 6 8 10 12 14 16 18

18 20 16 14 12 10 8 6 4 2 distance entre les points dans l’espace d’origine distance entre les points sur la projection

slide-24
SLIDE 24
slide-25
SLIDE 25

2-dimensional mapping of cities

5000 km

Honolulu Calgary San Diego Winnipeg Denver Cheyenne Anchorage Jasper Vancouver Seattle Portland Idaho Falls Salt Lake City Las Vegas San Francisco Los Angeles Flagstaff Phoenix Churchill Guadalajara Fort De France Guatemala City Monterrey Charleston Richmond Cleveland Charlottesville Washington Baltimore Atlantic City Boston Albany New York Philadelphia Toronto Ottawa Quebec Montreal Portland Detroit Charlotte Kindley Halifax Richmond Lexington Indianapolis Albany Huntsville Chicago Springfield Columbia Minneapolis Des Moines Kansas City Dodge City Fort Smith
  • St. Louis
Santa Fe Albuquerque Tucson Memphis New Orleans Houston Dallas Austin Orlando Miami Nassau Havana Merida Belize Vera Cruz Mexico City San Salvador Tegucigalpa Managua Panama City Guantanamo Bay Port Au Prince Santo Domingo San Juan Baranquilla Maracaibo Caracas Medellin Belo Horizonte Recife La Paz Salvador Lima Port of Spain Bogota Cali Georgetown Paramaribo Cayenne Belem Fortaleza Brasilia Sao Paulo Rio De Janeiro Curitiba Porto Alegre Montevideo Buenos Aires Ascuncion Tucuman Cordoba Santiago Valparaiso Guayaquil Quito Punta Arenas Narsarssuaq London Brussels Amsterdam Zurich Strasbourg Mannheim Hannover Hamburg Belgrade Prague Vienna Krakow Copenhagen Birmingham Cardiff Dublin Belfast Edinburgh Glasgow Paris Milan Nantes Lyon Marseilles Nice Munich Narvik Berlin Rome Naples Warsaw Kaliningrad Oslo Stockholm Budapest Minsk Leningrad Helsinki Reykjavik Tunis Barcelona Shannon Archangel Lajes Valencia Madrid Lisbon Gibraltar Casablanca Dakar Accra Monrovia Abidjan Lagos Benghazi Cape Town Brazzaville Nairobi Kinshasa Kisangani Asmara Mogadiscio Khartoum Addis Ababa Cairo Aden Jedda Johannesburg Pretoria Tananarive Dar es Salaam Jerusalem Mosul Odessa Amman Baghdad Tel Aviv Algiers Ankara Adana Thessaloniki Athens Sofia Istanbul Bucharest Izmir Damascus Beirut Alma Ata Sverdlovsk Tbilisi Moscow Kuibyshev Kiev Kharkov Volgograd Rostov on Don Krasnoyarsk Tashkent Abadan Meshed Dhahran Tehran Riyadh Karachi Colombo Madras Lahore New Delhi Peshwar Bombay Katmandu Ahmenabad Calcutta Kabul Bangalore Nagpur Vladivostok Shanghai Sapporo Petropavlovsk Tokyo Seoul Fukuoka Pyongyang Da Nang Hanoi Hong Kong Chongquing Manila Taipei Tainan Penang Medan Chittagong Rangoon Bangkok Mandalay Phnom Penh Palembang Djakarta Singapore Makassar Kupang Surabaya Kuala Lumpur Brisbane Melbourne Alice Springs Manokwari Port Moresby Adelaide Sydney Darwin Perth Christchurch Auckland Wellington Hanoï

Projection non linéaire

slide-26
SLIDE 26

FDP-MDS Robustness: noise

randomly located Prototypes

reference

slide-27
SLIDE 27

Reference

A prototype is trapped in a local minima

FDP-MDS Robustness: local minimum

slide-28
SLIDE 28

Accounting for neighborhood

Distant points can be more or less considered

slide-29
SLIDE 29

Frequencies of oligonucleotides (words): Génomic signature

2-letter word 3 4 6-letters word A T G C

CC GC CG AC CA AA TT TC

CCC ACC CAC AAC CCA

High frequency Low frequency

slide-30
SLIDE 30

A fulgidus B subtilis C jejeuni D radiodurans E coli H influenzae H pylori H sapiens C elegans M jannaschii P horikoshi R prowazekii Synechocystis sp T pallidum S cerevisiae

slide-31
SLIDE 31

1ère composante 2ème composante Principal Component Analysis

2nd component 1st component

genomic signatures mapping

VIRUS

slide-32
SLIDE 32

genomic signatures mapping

Eukaryota

Alveolata Fungi

Archaea

Crenarchaeota

Thermoprotei

Euryarchaeota

Actinobacteria; Actinobacteridae; Actinomycetales Deinococcus-Thermus; Deinococci

Bacteria

Fusobacteria;Fusobacterales; Fusobacteriaceae; Thermotogae;Thermotogales; Thermotogaceae Spirochaetes; Spirochaetales; Chlamydiae; Chlamydiales Cyanobacteria Nostocales Oscillatoriales Chroococcales

Firmicutes

Bacillales Lactobacillales Mollicutes Clostridia Clostridiales

Bacteroidetes Proteobacteria

Gammaproteobacteria Betaproteobacteria Deltaproteobacteria Alphaproteobacteria

Metazoa

Arthropoda Mollusca Nematoda;Chromadorea Chordata Craniata;Vertebrata; Euteleostomi;

stramenopiles

Parabasalidea;Trichomonadida; Trichomonadidae;Tritrichomonadinae

Rhodophyta Viridiplantae

Chlorophyta Streptophyta

Euglenozoa

Microsporidia Zygomycota;Zygomycetes; Mucorales;Mucoraceae Ascomycota Basidiomycota

2D FDP-MDS mapping

  • f the tree of life
  • Each point is associated to a species
  • Color depends on taxonomy

PCA poorly displays taxonomic groups

Principal Component Analysis

Axis 2 Axis 1

slide-33
SLIDE 33

Eukaryota

Alveolata Fungi

Archaea

Crenarchaeota

Thermoprotei

Euryarchaeota

Actinobacteria; Actinobacteridae; Actinomycetales Deinococcus-Thermus; Deinococci

Bacteria

Fusobacteria;Fusobacterales; Fusobacteriaceae; Thermotogae;Thermotogales; Thermotogaceae Spirochaetes; Spirochaetales; Chlamydiae; Chlamydiales Cyanobacteria Nostocales Oscillatoriales Chroococcales

Firmicutes

Bacillales Lactobacillales Mollicutes Clostridia Clostridiales

Bacteroidetes Proteobacteria

Gammaproteobacteria Betaproteobacteria Deltaproteobacteria Alphaproteobacteria

Metazoa

Arthropoda Mollusca Nematoda;Chromadorea Chordata Craniata;Vertebrata; Euteleostomi;

stramenopiles

Parabasalidea;Trichomonadida; Trichomonadidae;Tritrichomonadinae

Rhodophyta Viridiplantae

Chlorophyta Streptophyta

Euglenozoa

Microsporidia Zygomycota;Zygomycetes; Mucorales;Mucoraceae Ascomycota Basidiomycota

FDP-MDS mapping of signatures

Taxonomic groups are better segmented with FDP-MDS

Arrangement of the genomic signatures space

  • Each point is associated to a species
  • Color depends on taxonomy

2D FDP-MDS mapping

  • f the tree of life
slide-34
SLIDE 34

PNV ACP

400 ADNs 400 ARNs 400 ADNs 400 ARNs

10 20 30 40 50 60 70 80 90 100 110

distance entre les points dans l’espace d’origine distance entre les points sur la projection

slide-35
SLIDE 35

CCCC AAAA TTTT GGGG

Genomic sequence

Words

CCCC AAAA GGGG TTTT

100

position au long du génome

200 300 400 500 600 700 800 CCCC GGGG AAAA TTTT AATT TTAA

Position along the genome

Most of the “local signatures” are similar to the genomic signature

atypical

Clostridium acetobutylicum

style of genome

slide-36
SLIDE 36

Local versus genomic signatures

PCA

Local & Genomic signatures

  • B. subtilis
slide-37
SLIDE 37

Local versus genomic signatures

FDP-MDS Local & Genomic signatures

  • B. subtilis
slide-38
SLIDE 38

Local versus genomic signatures

FDP-MDS local signatures

  • B. subtilis
slide-39
SLIDE 39

Conclusion

  • FPD-MDS is very efficient for “midsize”

problems : data sample size < a few thousands

  • Easy managment of high dimensional data
  • Any metric (or pseudo-metric) can be used
  • May be extended to supervised learning