A problem - too many features D. Dubhashi D. Dubhashi Aim: To - PowerPoint PPT Presentation

Introduction Introduction A problem - too many features D. Dubhashi D. Dubhashi ◮ Aim: To build a classifier that can diagnose leukaemia Introduction Introduction using Gene expression data. Features Features TDA 231 Projections ◮ Data: 27 healthy samples,11 leukaemia samples ( N = 38). Projections PCA Each sample is the expression (activity) level for 3751 genes. PCA Dimension Reduction: PCA ICA (Also have an independent test set) ICA Devdatt Dubhashi dubhashi@chalmers.se Department of Computer Science and Engg. Chalmers University March 3, 2017 ◮ In general, the number of parameters will increase with the number of features – D = 3751. ◮ e.g. Logistic regression – w would have length 3751! ◮ Fitting lots of parameters is hard – imagine Metropolis-Hastings in 3751 dimensions rather than 2! Introduction Introduction Features Making new features D. Dubhashi D. Dubhashi Introduction Introduction Features Features Projections Projections ◮ An alternative to choosing features is making new ones. ◮ For visualisation, most examples we’ve seen have had PCA PCA ◮ Cluster: only 2 features x = [ x 1 , x 2 ] T . ICA ICA ◮ Cluster the features (turn our clustering problem ◮ We sometimes created more: x = [1 , x 1 x 2 1 , x 3 1 , . . . ] T . around) ◮ If we use say K-means, our new features will be the K ◮ Now, we’ve been given lots (3751) to start with. mean vectors. ◮ We need to reduce this number. ◮ Projection/combination ◮ 2 general schemes: ◮ Reduce the number of features by projecting into a ◮ Use a subset of the originals. lower dimensional space. ◮ Make new ones by combining the originals. ◮ Do this by making new features that are combinations (linear) of the old ones.

Introduction Introduction Projection Projection D. Dubhashi D. Dubhashi Introduction Introduction ◮ We can project data ( D dimensions) into a lower Features Features number of dimensions ( M ). A 3-dimensional Projections Projections object ◮ Z = XW PCA PCA ◮ X is N × D ICA ICA ◮ W is D × M ◮ Z is N × M – an M -dimensional representation of our N objects. ◮ W defines the projection ◮ Changing W is like changing where the light is coming from for the shadow (or rotating the hand). ◮ ( X is the hand, Z is the shadow) A 2-dimensional ◮ Once we’ve chosen W we can project test data into this projection new space too: Z new = X new W Introduction Introduction Choosing W Principal Components Analysis D. Dubhashi D. Dubhashi ◮ Principal Components Analysis (PCA) is a method for ◮ Different W will give us different projections (imagine Introduction Introduction choosing W . moving the light). Features Features ◮ It finds the columns of W one at a time (define the m th Projections Projections ◮ Which should we use? column as w m ). PCA PCA ◮ Not all will represent our data well... ◮ Each D × 1 column defines one new dimension. ICA ICA ◮ Consider one of the new dimensions (columns of Z ): z m = Xw m This doesn't look ◮ PCA chooses w m to maximise the variance of z m like a hand! N N 1 µ m = 1 � ( z mn − µ m ) 2 , � z mn N N n =1 n =1 ◮ Once the first one has been found, the w 2 is found that maximises the variance and is orthogonal to the first one etc etc.

Introduction Introduction PCA – a visualisation PCA – a visualisation D. Dubhashi D. Dubhashi Introduction Introduction 3 Features Features 2 3 Projections Projections σ 2 z = 0 . 39 1 2 PCA PCA x 2 ICA ICA 0 1 −1 x 2 0 −2 −1 −3 −2 −3 −2 −1 0 1 2 3 x 1 −3 −3 −2 −1 0 1 2 3 x 1 ◮ Pick some arbitrary w . ◮ Project the data onto it. ◮ Original data in 2-dimensions. ◮ Compute the variance (on the line). ◮ We’d like a 1-dimensional projection. ◮ The position on the line is our 1 dimensional representation. Introduction Introduction PCA – a visualisation PCA – a visualisation D. Dubhashi D. Dubhashi Introduction Introduction 3 3 Features Features 2 2 Projections σ 2 Projections z = 1 . 9 σ 2 σ 2 1 z = 0 . 39 1 z = 0 . 39 PCA PCA x 2 x 2 0 ICA 0 ICA −1 −1 −2 −2 σ 2 σ 2 z = 1 . 2 z = 1 . 2 −3 −3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 x 1 x 1 ◮ Pick some arbitrary w . ◮ Pick some arbitrary w . ◮ Project the data onto it. ◮ Project the data onto it. ◮ Compute the variance (on the line). ◮ Compute the variance (on the line). ◮ The position on the line is our 1 dimensional ◮ The position on the line is our 1 dimensional representation. representation.

Introduction Introduction PCA – analytic solution PCA – analytic solution D. Dubhashi D. Dubhashi Introduction Introduction 3 Features Features Projections Projections 2 PCA PCA ICA ICA 1 ◮ Could search for w 1 , . . . , w M ◮ But, analytic solution is available. x 2 0 ◮ w are the eignvectors of the covariance matrix of X . −1 ◮ Matlab: princomp(x) −2 σ 2 z = 1 . 9 −3 −3 −2 −1 0 1 2 3 x 1 ◮ What would be the second component? Introduction Introduction PCA – leukaemia data PCA – leukaemia data D. Dubhashi D. Dubhashi Introduction Introduction 20 0.12 Features Features Projections Projections 10 0.1 PCA PCA ICA ICA 0 Test error 0.08 z 2 −10 0.06 −20 0.04 −30 −40 −20 0 20 40 0.02 z 1 0 5 10 15 20 25 30 M First two principal components in our leukaemia data (points Test error as more and more components are used. labeled by class).

Introduction Introduction Summary D. Dubhashi D. Dubhashi Introduction Introduction Features Features Projections Projections PCA PCA ICA ICA Part 2: ICA ◮ Sometimes we have too much data (too many dimensions). (the cocktail party problem) ◮ Features can be dimensions that already exist. ◮ Or we can make new ones. Introduction Introduction The cocktail party problem Demo D. Dubhashi D. Dubhashi Introduction Introduction Microphone 4 Features Features Microphone 3 Projections Projections ◮ Online: PCA PCA ICA ICA ◮ http://www.cis.hut.fi/projects/ica/cocktail/ cocktail_en.cgi ◮ Matlab: ◮ Available on course webpage ◮ To run: ◮ load ica demo.mat Microphone 2 Microphone 1 ◮ ica image ◮ Each microphone will record a combination of all speakers. ◮ Can we separate them back out again?

Introduction Introduction Independent components analysis – how it Inference D. Dubhashi D. Dubhashi works... Introduction Introduction ◮ Corrupted data (images/sounds) is a vector of D Features Features numbers. i.e. n th image: Projections Projections ◮ From Bayes’ (look back...) PCA PCA x n ICA ICA p ( S | X , A , σ 2 ) ∝ p ( X | S , A , σ 2 ) p ( S ) ◮ We have N images – stack them up into an N × D ◮ In our demo, we found values of S , A and σ 2 that matrix: X maximised the log posterior. ◮ MAP solution... ◮ Assume that this is the result of the following ◮ There is some further reading on the webpage if you corrupting process: want to know more... X = AS + E ◮ A is mixing matrix. E is noise. ( S is N × D ). e nd ∼ N (0 , σ 2 ) Introduction Introduction Aside – ICA and the central limit theorem Aside – ICA and the central limit theorem D. Dubhashi D. Dubhashi Introduction Introduction ◮ Central limit theorem (paraphrased): Features Features ◮ Sometimes ICA is performed by reversing this theorem: ◮ If we keep adding the outcomes of independent random Projections Projections PCA PCA variables together, we eventually get something that X = AS + E looks Gaussian. ICA ICA ◮ Example: Roll a die m times and take the average. ◮ X is some random variables added together. (Repeat this lots of times to get histogram) ◮ It will be more ‘Gaussian’ than S 200 200 250 200 ◮ Find S that is as non-Gaussian as possible. 150 150 150 100 100 100 ◮ More resource: 50 50 50 0 0 0 ◮ http://www.cis.hut.fi/projects/ica/icademo/ 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 Average roll Average roll Average roll ◮ http://www.cis.hut.fi/projects/ica/ ◮ From left to right: m = 1, m = 2, m = 5. Looking more Gaussian as m increases.

Introduction Summary D. Dubhashi Introduction Features Projections ◮ PCA and ICA are both examples of projection PCA techniques. ICA ◮ Both assume a linear transformation ◮ ICA: X = AS + E ◮ PCA: Z = XW ◮ PCA can be used for Data pre-processing or visualisation. ◮ ICA can be used to separate sources that have been mixed together. ◮ Also looked at PCA as a feature selection method.

A problem - too many features D. Dubhashi D. Dubhashi Aim: To - PowerPoint PPT Presentation

Introduction Introduction A problem - too many features D. Dubhashi D. Dubhashi Aim: To build a classifier that can diagnose leukaemia Introduction Introduction using Gene expression data. Features Features TDA 231 Projections

Upstream Graphics: Too Little, Too Late Upstream Graphics: Too Little, Too Late Daniel Vetter,

COMPANY PROFILE WATER FEATURES 1 WATER FEATURES 2 WATER FEATURES 3 WATER FEATURES 4 WATER

Not a Fairy Tale Year for State Risk Managers Too hot, Too cold Too little, Too much Lessons

too hilly, too hard What are the barriers to cycling in the UK? David Wildman 15 th

Toward Efficient Many-to-Many Broadcast in Dynamic Wireless Networks Fabian Mager , Carsten

Lollipops taste of Lollipops taste of Lollipops taste of Vanilla too Vanilla too Vanilla too

1 04/12/12 Too much food and too many birds The birds loose their fear Goose dung

Problem Definition Problem Definition Problem Definition Problem Definition Problem Definition

Medication Therapy Management Dr. Kimberly Robbins, R.Ph. Patients are overwhelmed Too many

Too much or too little red cells What should you do? Dr Melissa Ooi Consultant Haematologist,

Texture Synthesis Presented by James Hays Problem Statement 1 Problem Statement Problem

BLOGGING How to blog well FEATURES OF A BLOG... FEATURES OF A BLOG... Chronological

NOT FOR REPRODUCTION Too Many Losses Too Soon: Loss and Grief Among Foster and Adopted Children

SUNSET ON TV IN CHINA DUE TO SMOG GLOBAL TRENDS: FUTURE CONSUMERS TOO MANY BRANDS TOO MUCH

Many Features, Few Samples: Many Features, Few Samples: From cheminformatics cheminformatics to

Particle Filtering Sometimes |X| is too big to use exact inference |X| may be too big to

IASI Principle Component Analysis, IASI and AIRS Regression Retrieval Comparisons and Update on

Light and matter Astronomy 101 Syracuse University, Fall 2020 Walter Freeman October 22, 2020

Responsiveness Human perception and expectations Importance of timely feedback Handling long

A Flexible Simulator to Evaluate a Power Saving System for HPC Clusters Manuel F. Dolz, Juan C.

llvm::Error Rich Error Handling in LLVM Error Handling History LLVMs APIs historically

LECTURE 16 HIGHER-ORDER FUNCTIONS & EXCEPTIONS MCS 260 Fall 2020 David Dumas / REMINDERS

On the Index Coding and Caching Problems Eimear Byrne 1 (joint work with Marco Calderini 2 ) 1

SAMS Programming A/B Week 4 Lecture Lists July 24, 2017 Mark Stehlik Quiz Lots of

A problem - too many features D. Dubhashi D. Dubhashi Aim: To - PowerPoint PPT Presentation

Introduction Introduction A problem - too many features D. Dubhashi D. Dubhashi Aim: To build a classifier that can diagnose leukaemia Introduction Introduction using Gene expression data. Features Features TDA 231 Projections

Upstream Graphics: Too Little, Too Late Upstream Graphics: Too Little, Too Late Daniel Vetter,

COMPANY PROFILE WATER FEATURES 1 WATER FEATURES 2 WATER FEATURES 3 WATER FEATURES 4 WATER

Not a Fairy Tale Year for State Risk Managers Too hot, Too cold Too little, Too much Lessons

too hilly, too hard What are the barriers to cycling in the UK? David Wildman 15 th

Toward Efficient Many-to-Many Broadcast in Dynamic Wireless Networks Fabian Mager , Carsten

Lollipops taste of Lollipops taste of Lollipops taste of Vanilla too Vanilla too Vanilla too

1 04/12/12 Too much food and too many birds The birds loose their fear Goose dung

Problem Definition Problem Definition Problem Definition Problem Definition Problem Definition

Medication Therapy Management Dr. Kimberly Robbins, R.Ph. Patients are overwhelmed Too many

Too much or too little red cells What should you do? Dr Melissa Ooi Consultant Haematologist,

Texture Synthesis Presented by James Hays Problem Statement 1 Problem Statement Problem

BLOGGING How to blog well FEATURES OF A BLOG... FEATURES OF A BLOG... Chronological

NOT FOR REPRODUCTION Too Many Losses Too Soon: Loss and Grief Among Foster and Adopted Children

SUNSET ON TV IN CHINA DUE TO SMOG GLOBAL TRENDS: FUTURE CONSUMERS TOO MANY BRANDS TOO MUCH

Many Features, Few Samples: Many Features, Few Samples: From cheminformatics cheminformatics to

Particle Filtering Sometimes |X| is too big to use exact inference |X| may be too big to

IASI Principle Component Analysis, IASI and AIRS Regression Retrieval Comparisons and Update on

Light and matter Astronomy 101 Syracuse University, Fall 2020 Walter Freeman October 22, 2020

Responsiveness Human perception and expectations Importance of timely feedback Handling long

A Flexible Simulator to Evaluate a Power Saving System for HPC Clusters Manuel F. Dolz, Juan C.

llvm::Error Rich Error Handling in LLVM Error Handling History LLVMs APIs historically

LECTURE 16 HIGHER-ORDER FUNCTIONS &amp; EXCEPTIONS MCS 260 Fall 2020 David Dumas / REMINDERS

On the Index Coding and Caching Problems Eimear Byrne 1 (joint work with Marco Calderini 2 ) 1

SAMS Programming A/B Week 4 Lecture Lists July 24, 2017 Mark Stehlik Quiz Lots of

LECTURE 16 HIGHER-ORDER FUNCTIONS & EXCEPTIONS MCS 260 Fall 2020 David Dumas / REMINDERS