1 Methodology 2 Machine Learning 2018 Peter Bloem Today we will - - PowerPoint PPT Presentation

1
SMART_READER_LITE
LIVE PREVIEW

1 Methodology 2 Machine Learning 2018 Peter Bloem Today we will - - PowerPoint PPT Presentation

1 Methodology 2 Machine Learning 2018 Peter Bloem Today we will be talking about what happens before the basic machine learning recipe. How do we get features from our data, and how 2 machine learning: the basic recipe do clean up our data so


slide-1
SLIDE 1

Methodology 2

Machine Learning 2018 Peter Bloem

1

machine learning: the basic recipe

Abstract (part of) your problem to a standard task.


Classifjcation, Regression, Clustering, Density estimation, Generative Modeling, Online learning, Reinforcement Learning, Structured Output Learning

Choose your instances and their features. 


For supervised learning, choose a target.

Choose your model class.


Linear models, Decision Trees, kNN,

Search for a good model.


Usually, a model comes with its own search method. Sometimes multiple options are available.

2

2

Today we will be talking about what happens before the basic machine learning recipe. How do we get features from our data, and how do clean up our data so that a machine learning algorithm can consume it.

methodology

part 1: 
 Cleaning your data Choosing features part 2: 
 Normalisation Principal Component Analysis Eigenfaces

3

3 22.Methodology2.key - 20 March 2018

slide-2
SLIDE 2

cleaning your data

  • Missing data
  • Outliers

4

4

5

income status unemployed 32000 married true single false 89000 true 34000 divorced false 54000 married true false 21000 true 25000 single true

5

simple solutions

Remove the feature Remove the instances

  • are the data missing uniformly?

6

6

The simplest way to get rid of missing data is to just remove the feature(s) for which there are values missing. If you’re lucky, the feature is not important anyway. You can also remove the instances (i.e. the rows) with missing data. Here you have to be careful. If the data was not corrupted uniformly, removing rows with missing values will change your data distribution. You might have data gathered by volunteers. If only one volunteer had a hardware problem, then only his data will contain missing values. Another reason for unequally distributed missing data is if people refuse to answer certain questions. For instance, if only rich people refuse to answer questions, removing these instances will remove lots of rich people from your data and give you a different distribution.

22.Methodology2.key - 20 March 2018

slide-3
SLIDE 3

Think about the REAL-WORLD use case.

7

7

Whenever you have questions about how to approach something like this, it’s best to think about the real world setting where you might apply you trained model. Can you expect missing data there too, or will that data be clean already? Examples of production systems that should expect missing data are situations where data comes from a form with optional values or situations where data is merged from different sources (online forms and phone surveys).

will you get missing values in production?

YES:

Keep them in the test set, and make a model that can consume them.

NO:

Endeavour to get a test set without missing values, and test different methods for completing the data.

8

8

If you can reasonably assume that the values are missing uniformly, then you can just sample instances without missing values for your test set. Otherwise, you’ll have to model the process that corrupted your data (which is outside the scope of this lecture). How can you tell? There’s no sureKire way, but usually you can get a good idea by plotting a histogram of how much data is missing against some other feature. For instance if the number of instance with missing features against income is very different from the regular histogram over income, you can assume that your data was not corrupted uniformly.

guess the missing values (imputation)

categorical data: use the mode numerical data: use the mean make the feature a target value and train a model

kNN, linear regression, etc.

9

9 22.Methodology2.key - 20 March 2018

slide-4
SLIDE 4

10

Outliers

10

Outliers come in different shapes and sizes. Here, the six dots to the right are so oddly, mechanically aligned that we are probably looking at some measurement error (perhaps someone using the value -1 for missing data). We can remove these, or interpret them as missing data, and use the approaches just discussed.

11

income

11

Here however, the “outlier” is very much part of the distribution. If we Kit a normal distribution to this data, the outlier would ruin our Kit, but that’s because the data isn’t normally distributed. Here we should leave the outlier in, and adapt our model.

12

12

If our instance are image of faces, the image on the left is an extreme of our data distribution. It looks odd, but it can be very helpful in Kitting a model. The left is clearly corrupted data, that we may want to clean. However, remember the real-world use-case.

22.Methodology2.key - 20 March 2018

slide-5
SLIDE 5

Are they mistakes?

  • Yes: deal with them.
  • No: leave them be. Check your model for strong

assumptions of normality. Can we expect them in production?

  • Yes: Make sure the model can deal with them.
  • No: Remove. Get a test set that represents the

production situation.

13

13

If you have very extreme values that are not mistakes (like Bill Gates earlier), your data is probably not normally distributed. If you use a model which assumes normally distributed data (like linear regression), it will be very sensitive to these kinds of “outliers”. It may be a good idea to remove this assumption from your model (or replace it by an assumption of a heavy-tailed distribution). See also Kigure 7.2 in the book

models that can deal with outliers

Beware of squared errors (MSE). Model noise with a heavy-tailed distribution The proof is in the pudding.

The performance on the test/validation set will be the deciding factor.

14

14

getting features

15

phone nr income status unemployed birthdate 0646785910 32000 married true 4-5-78 0207859461 45000 single false 3-6-00 0218945958 89000 married true 4-7-91 0645789384 34000 divorced false 3-11-94 0652438904 54000 married true 21-3-95 0309897969 36000 single false 4-12-46 0159874645 21000 single true 13-8-52 0256789456 25000 single true 16-8-79

15

Even if your data comes in a table, that doesn’t necessary mean that every column can be used as a feature right away (or that this would be a good approach).

22.Methodology2.key - 20 March 2018

slide-6
SLIDE 6

from: date, phone number, images, status, text, category, tags, etc… to: numeric, categoric, both.

16

16

Some algorithms (like linear models or kNN) work only on numeric features. Some work only on categorical features, and some can accept a mix of both (like decision trees). Translating your raw data into features is more an art than a science, and the ultimate test is the test set performance. But let’s look at a few examples, to get a general sense of the way of thinking.

age

to numeric: From integer to real-valued. Not usually an issue. to categoric: Bin the data? Above of below the median?

  • Information loss is unavoidable.

17

17

Age is integer valued, while numeric features are usually real-valued. In this case, the transformation is Kine, and we can just interpret the age as a real-valued number. To transform a numeric feature to categoric values we’ll have to bin the data. We’ll lose information this way, which is unavoidable, but if you have a classiKier that only consumes categorical features, and works really well on your data, it may be worth it.

phone number

0235678943 to numeric: From integer (?) to real-valued. Highly problematic. to categoric: area codes, cell phone vs. landline

18

18

We can represent phone numbers as integers too, so you might think the translation to numeric is Kine. But here it makes no sense at all. Translating to a real valued feature would impose an ordering on the phone numbers that would be totally meaningless. My phone number may represent a higher number than yours, but that has no bearing on any possible target value. What is potentially useful information, is the area code. This tells us where a person lives, which gives an indication of their age, their political leanings, their income, etc. Wether or not the phone number is for a mobile or a landline may also be useful. But these are categorical features.

22.Methodology2.key - 20 March 2018

slide-7
SLIDE 7

categoric to numeric

19

genre sci-fj romance comedy thriller thriller romance romance sci-fj thriller comedy

integer coding:

  • ne-hot coding:

aka 1-of-N coding

genre 1 2 4 3 3 2 2 1 3 4 scifj romance comedy thriller 1 1 1 1 1 1 1 1 1 1

19

So what if our model only accepts numerical features? How do we feed it categorical data? Here are two approaches. Integer coding gives us the same problem we had earlier. We are imposing a false ordering on unordered data. One-hot coding avoids this issue, by turning one categorical feature into several numeric features. Per genre we can say whether it applies to the instance or not.

expanding features

20

20

How to get the useful information from your data into your classiKier depends entirely on what your classiKier can handle. The linear classiKier is a good example. It’s quite limited in what kinds of relations it can represent. Essentially, each feature can only inKluence the classiKication boundary in a simple way. It can push it up or down, but it can’t let its inKluence depend on the values of the other features. Here is a (slightly contrived) example of when that might be necessary. Imagine classifying spam emails on two features: to what extent the email mentions drugs, and to what extent the email is sent to a pharmaceutical company. This problem is completely impossible for a linear classiKier.

cross product

21

d p d * p

0.75 0.98 0.74

  • 0.66
  • 0.32

0.21

  • 0.45

0.84

  • 0.38

0.93 0.78 0.72

  • 0.42

0.24

  • 0.10
  • 0.02

0.43

  • 0.01
  • 0.74

0.58

  • 0.43
  • 0.41
  • 0.41

0.17 0.59 0.72 0.42

21

We can switch to a more powerful model, but we can also add power to the linear classiKier by adding extra features. Here we’ve added the cross-product of d and p. (Note the XOR relationship of the signs: two negatives or two positives both make positive, a negative and a positive make a negative). Now the classiKier can separate the classes perfectly by just looking at whether or not the third column is positive or negative.

22.Methodology2.key - 20 March 2018

slide-8
SLIDE 8

22

22

This is a linear classier that operates in a 3D space. But since the third dimension is derived from the other two, we can colour our

  • riginal 3D space with the classiKications. Projected down to 2D, the classiKier solves our XOR problem perfectly.

You can try this yourself at playground.tensor7low.org.

regression

23

23

We can do the same thing with regression. Here, we have a very non-linear relation.

24

24

A purely linear classiKier does a terrible job.

22.Methodology2.key - 20 March 2018

slide-9
SLIDE 9

25

y = wx + b y = w1x + w2x2 + b

<latexit sha1_base64="Fn0nqLvLWGAF3eCcZxqlTiWcAto=">AGpXicdZRLb9NAEMfdQkMJFo4crGIQBVElR1oCYdKpS20B9qGKg+kOETrzcSx4pd210nc1X4lPg0XDvBZWD9a2XHYS/7Z38xoHusxA8emTN+r63fu79RebD5sPro8daTp9s7z7rUDwmGDvYdn3w3EQXH9qDbObA94Ack0Heub0JOa9GRBq+16bRQEMXGR59tjGiMmr4fa5YWEeCfX1oRqruViobxNlCtUwqoZ/R6WaD/WUJ7ohFj8a2T9TDLdr2p6WHLUs9EzUlOy0hjsbX4yRj0MXPIYdRGlf1wI24IgwGzsgqkZIUB4izoh2zcHDbC0IGHhbqK8nGoaMyX42rUkc2AcycSAqEiS0jqHiCMJM1l4thqLgIRdofTSzA5pKOrNSwZBs3IAvksaKrYIntwgKJjZeFLjyKUuYpPSJY1cs3gJoQNk5hYv4zRlkuWCyDYpnETWrIzV0E8LNr2WxmfRMEPCp4SByRd5QACIGxdEwkBRYGPKlGvpApPWQkhHosk7vDU0Sm1zCqyziFi2I6Y8dHTMhmeDHvusib8SNQHCDwYJxo74nklbl6bXg3Ij7YprqdYwL9DJHL8Vy5M4dHasdSQuwm4PdUuBejvaWXc0wR8MSneXorBTZnOfwvIQXObo0ShHoxK9ydGbciuRHS/MeBpu5Mx8SvHnsEZAfAErzXEci1ETrCvF13iqfKaLpJ2j2AsF0YK3Cg25+fti6+CnzQb+9qBWLYwnRBuTbR3B/snWsnESrPJbLRms3FcsvEJ8qy7QKefDz7p2vL0CS6lnmWo1nS1VKq1yjzLZaWDucohrW+l/bRsf0ZQ9B9rf1X027JvPZYqngbxA5hi+U3Fyw856YxOQa5FAhfyYVzJTxkxn7yRr4FYri1rk79GPVZVuXn15T1bFp3G3sc97dv72tFxtoI3lRfKS2VX0ZUPypFyrSUjoKVn8ov5Y/yt7Jbuay0K93UdH0t83muFE5l+A+V1WD</latexit><latexit sha1_base64="Fn0nqLvLWGAF3eCcZxqlTiWcAto=">AGpXicdZRLb9NAEMfdQkMJFo4crGIQBVElR1oCYdKpS20B9qGKg+kOETrzcSx4pd210nc1X4lPg0XDvBZWD9a2XHYS/7Z38xoHusxA8emTN+r63fu79RebD5sPro8daTp9s7z7rUDwmGDvYdn3w3EQXH9qDbObA94Ack0Heub0JOa9GRBq+16bRQEMXGR59tjGiMmr4fa5YWEeCfX1oRqruViobxNlCtUwqoZ/R6WaD/WUJ7ohFj8a2T9TDLdr2p6WHLUs9EzUlOy0hjsbX4yRj0MXPIYdRGlf1wI24IgwGzsgqkZIUB4izoh2zcHDbC0IGHhbqK8nGoaMyX42rUkc2AcycSAqEiS0jqHiCMJM1l4thqLgIRdofTSzA5pKOrNSwZBs3IAvksaKrYIntwgKJjZeFLjyKUuYpPSJY1cs3gJoQNk5hYv4zRlkuWCyDYpnETWrIzV0E8LNr2WxmfRMEPCp4SByRd5QACIGxdEwkBRYGPKlGvpApPWQkhHosk7vDU0Sm1zCqyziFi2I6Y8dHTMhmeDHvusib8SNQHCDwYJxo74nklbl6bXg3Ij7YprqdYwL9DJHL8Vy5M4dHasdSQuwm4PdUuBejvaWXc0wR8MSneXorBTZnOfwvIQXObo0ShHoxK9ydGbciuRHS/MeBpu5Mx8SvHnsEZAfAErzXEci1ETrCvF13iqfKaLpJ2j2AsF0YK3Cg25+fti6+CnzQb+9qBWLYwnRBuTbR3B/snWsnESrPJbLRms3FcsvEJ8qy7QKefDz7p2vL0CS6lnmWo1nS1VKq1yjzLZaWDucohrW+l/bRsf0ZQ9B9rf1X027JvPZYqngbxA5hi+U3Fyw856YxOQa5FAhfyYVzJTxkxn7yRr4FYri1rk79GPVZVuXn15T1bFp3G3sc97dv72tFxtoI3lRfKS2VX0ZUPypFyrSUjoKVn8ov5Y/yt7Jbuay0K93UdH0t83muFE5l+A+V1WD</latexit><latexit sha1_base64="Fn0nqLvLWGAF3eCcZxqlTiWcAto=">AGpXicdZRLb9NAEMfdQkMJFo4crGIQBVElR1oCYdKpS20B9qGKg+kOETrzcSx4pd210nc1X4lPg0XDvBZWD9a2XHYS/7Z38xoHusxA8emTN+r63fu79RebD5sPro8daTp9s7z7rUDwmGDvYdn3w3EQXH9qDbObA94Ack0Heub0JOa9GRBq+16bRQEMXGR59tjGiMmr4fa5YWEeCfX1oRqruViobxNlCtUwqoZ/R6WaD/WUJ7ohFj8a2T9TDLdr2p6WHLUs9EzUlOy0hjsbX4yRj0MXPIYdRGlf1wI24IgwGzsgqkZIUB4izoh2zcHDbC0IGHhbqK8nGoaMyX42rUkc2AcycSAqEiS0jqHiCMJM1l4thqLgIRdofTSzA5pKOrNSwZBs3IAvksaKrYIntwgKJjZeFLjyKUuYpPSJY1cs3gJoQNk5hYv4zRlkuWCyDYpnETWrIzV0E8LNr2WxmfRMEPCp4SByRd5QACIGxdEwkBRYGPKlGvpApPWQkhHosk7vDU0Sm1zCqyziFi2I6Y8dHTMhmeDHvusib8SNQHCDwYJxo74nklbl6bXg3Ij7YprqdYwL9DJHL8Vy5M4dHasdSQuwm4PdUuBejvaWXc0wR8MSneXorBTZnOfwvIQXObo0ShHoxK9ydGbciuRHS/MeBpu5Mx8SvHnsEZAfAErzXEci1ETrCvF13iqfKaLpJ2j2AsF0YK3Cg25+fti6+CnzQb+9qBWLYwnRBuTbR3B/snWsnESrPJbLRms3FcsvEJ8qy7QKefDz7p2vL0CS6lnmWo1nS1VKq1yjzLZaWDucohrW+l/bRsf0ZQ9B9rf1X027JvPZYqngbxA5hi+U3Fyw856YxOQa5FAhfyYVzJTxkxn7yRr4FYri1rk79GPVZVuXn15T1bFp3G3sc97dv72tFxtoI3lRfKS2VX0ZUPypFyrSUjoKVn8ov5Y/yt7Jbuay0K93UdH0t83muFE5l+A+V1WD</latexit>

25

We can Kit a parabola through the data perfectly. We can see this as a more powerful model, but we can also see this as a 2D classiKication problem, where the second feature is derived from the Kirst. This is relevant because linear models are extremely simple to Kit. By adding derived features we can have our cake and eat it too. A simple model that we can Kit quickly and accurately, and a powerful model that can Kit many nonlinear aspects of the data. If we don’t have any intuition for which extra features might be worth adding, we can just add all cross products. Other functions like the sine or the logarithm may also help a lot.

26

26

One Kinal example. Here we color points red if the distance to the origin is less than 0.7. Again, this dataset is not at all linearly separable. Using Pythagoras, however, we can express how the classes are decided: if x12 + x22 < 0.72 then we classify as red, otherwise as blue. This is a linear decision boundary for the features x12 and x22.

27

27 22.Methodology2.key - 20 March 2018

slide-10
SLIDE 10

break

28

28

normalisation

29

?

29

29

Let’s go back to the kNN classiKier.

30

? year of birth pupil dilation (m)

1900 0.002 0.010 1920 1940 1960 1980 2000

distance: 40 d i s t a n c e : . 2

30

Imagine we are using a 1-NN classiKier (i.e. it only looks at the nearest example, and copies its class). In this plot, it looks like the blue and the red dot are the same distance away. Actually, because years are so much bigger than pupils, the blue dot will always be much closer. But this distinction is not meaningful. What we want to look at is how much spread there is in the data, and use that as our natural distance. We do that by normalising our data before feeding it to the model.

22.Methodology2.key - 20 March 2018

slide-11
SLIDE 11

normalisation

31

1 range 1

x ← x − xmin xmax − xmin

<latexit sha1_base64="Jr8P4ZLPD4XQmkiGb10nA2VfFyM=">AHlXicfVXbjts2EFUujTdO027SlwB9EWoUCALXkL3dSx5SpFngiLJOov1boCVsaDokSyYkgiSsq0Q/K38S4C8pt8RSpYXkqmWT4M5w4hwOR0nIheN8uXHz1u0f7rR27rbv/Xj/p593Hzw850nKMIxQhL20UMcSBjDWISCwEfKAEUegQtvfpzjFwtgPEziM5FRmEQoiEM/xEjo1NXuyA2wXCnbJeALxFiytF2fISzL/B/26soVsBIyCmOlpOvp9CaDTIK62u04Pac4thn0y6BjlWd09eBOz50mOI0gFpgzi/7DhUTiZgIMQHVdlMOFOE5CuAyFf7RIYxTQXEWNm/a8xPiS0SO+/OnoYMsCZDhBmoa5g4xnS/QjtQbteikOMIuDd6SKkfB3yRbAOBNIGTuSqMFjdrylwBCdhXhVu5pEY+QmBlJnkVePQkpAbaI6sn8mvqSW8wVMBzy3ISRduaE5o/Gz5JRic8yOoOYK5kyoqpCDQBj4GthEXIQKZVFN3pS5vyZYCl087DIPRsiNj+FaVfXqSXq1/FJgkQ95ek2tDsxLHESRSieSpfqKSkGwu32VOFdFT1VUrq5UZ5n+ZwDX1fQd8rtaUdX6O+PdZoDTyvgOdG4YsKerEt9dIKmhroIujMresgIvDXhVQVcGmlXQzEA/VdBPpVIv/zlYCLXdhfvJk9IuIDXDCBWsjNQ270w/aSX/bokf2bZ6avC7in4+idZA1GW0+Wbs3dvlTw+Guw7B2qb4ZEUNhRn72D/2DEowfo2Jc5Ohq8MDgJQ3FwXWj48uDvlmIpoySa9Lh4d6rp2alDAhJlteVjl8MB3vbc8SwYULZq93p24ZpQRO97KpR4DUJ1k418ucm/zVD2X+wk6bqGwMbFbRJsXGzUZE1KTbWbhRbTdB8Wud6c9D860ZkTRmC/tQZvNTfKI/IiQS9kSPLgvydaE/+cDt5tH/EfWaKYk6arf1hulv7xMzGA96T3vOhz87z/8pV82O9av1m/XY6luH1nPrjTWyxha2PltfrW/Wv61Hrb9aw9arNfXmjVLzi1U7rZPvQqLB8A=</latexit><latexit sha1_base64="Jr8P4ZLPD4XQmkiGb10nA2VfFyM=">AHlXicfVXbjts2EFUujTdO027SlwB9EWoUCALXkL3dSx5SpFngiLJOov1boCVsaDokSyYkgiSsq0Q/K38S4C8pt8RSpYXkqmWT4M5w4hwOR0nIheN8uXHz1u0f7rR27rbv/Xj/p593Hzw850nKMIxQhL20UMcSBjDWISCwEfKAEUegQtvfpzjFwtgPEziM5FRmEQoiEM/xEjo1NXuyA2wXCnbJeALxFiytF2fISzL/B/26soVsBIyCmOlpOvp9CaDTIK62u04Pac4thn0y6BjlWd09eBOz50mOI0gFpgzi/7DhUTiZgIMQHVdlMOFOE5CuAyFf7RIYxTQXEWNm/a8xPiS0SO+/OnoYMsCZDhBmoa5g4xnS/QjtQbteikOMIuDd6SKkfB3yRbAOBNIGTuSqMFjdrylwBCdhXhVu5pEY+QmBlJnkVePQkpAbaI6sn8mvqSW8wVMBzy3ISRduaE5o/Gz5JRic8yOoOYK5kyoqpCDQBj4GthEXIQKZVFN3pS5vyZYCl087DIPRsiNj+FaVfXqSXq1/FJgkQ95ek2tDsxLHESRSieSpfqKSkGwu32VOFdFT1VUrq5UZ5n+ZwDX1fQd8rtaUdX6O+PdZoDTyvgOdG4YsKerEt9dIKmhroIujMresgIvDXhVQVcGmlXQzEA/VdBPpVIv/zlYCLXdhfvJk9IuIDXDCBWsjNQ270w/aSX/bokf2bZ6avC7in4+idZA1GW0+Wbs3dvlTw+Guw7B2qb4ZEUNhRn72D/2DEowfo2Jc5Ohq8MDgJQ3FwXWj48uDvlmIpoySa9Lh4d6rp2alDAhJlteVjl8MB3vbc8SwYULZq93p24ZpQRO97KpR4DUJ1k418ucm/zVD2X+wk6bqGwMbFbRJsXGzUZE1KTbWbhRbTdB8Wud6c9D860ZkTRmC/tQZvNTfKI/IiQS9kSPLgvydaE/+cDt5tH/EfWaKYk6arf1hulv7xMzGA96T3vOhz87z/8pV82O9av1m/XY6luH1nPrjTWyxha2PltfrW/Wv61Hrb9aw9arNfXmjVLzi1U7rZPvQqLB8A=</latexit><latexit sha1_base64="Jr8P4ZLPD4XQmkiGb10nA2VfFyM=">AHlXicfVXbjts2EFUujTdO027SlwB9EWoUCALXkL3dSx5SpFngiLJOov1boCVsaDokSyYkgiSsq0Q/K38S4C8pt8RSpYXkqmWT4M5w4hwOR0nIheN8uXHz1u0f7rR27rbv/Xj/p593Hzw850nKMIxQhL20UMcSBjDWISCwEfKAEUegQtvfpzjFwtgPEziM5FRmEQoiEM/xEjo1NXuyA2wXCnbJeALxFiytF2fISzL/B/26soVsBIyCmOlpOvp9CaDTIK62u04Pac4thn0y6BjlWd09eBOz50mOI0gFpgzi/7DhUTiZgIMQHVdlMOFOE5CuAyFf7RIYxTQXEWNm/a8xPiS0SO+/OnoYMsCZDhBmoa5g4xnS/QjtQbteikOMIuDd6SKkfB3yRbAOBNIGTuSqMFjdrylwBCdhXhVu5pEY+QmBlJnkVePQkpAbaI6sn8mvqSW8wVMBzy3ISRduaE5o/Gz5JRic8yOoOYK5kyoqpCDQBj4GthEXIQKZVFN3pS5vyZYCl087DIPRsiNj+FaVfXqSXq1/FJgkQ95ek2tDsxLHESRSieSpfqKSkGwu32VOFdFT1VUrq5UZ5n+ZwDX1fQd8rtaUdX6O+PdZoDTyvgOdG4YsKerEt9dIKmhroIujMresgIvDXhVQVcGmlXQzEA/VdBPpVIv/zlYCLXdhfvJk9IuIDXDCBWsjNQ270w/aSX/bokf2bZ6avC7in4+idZA1GW0+Wbs3dvlTw+Guw7B2qb4ZEUNhRn72D/2DEowfo2Jc5Ohq8MDgJQ3FwXWj48uDvlmIpoySa9Lh4d6rp2alDAhJlteVjl8MB3vbc8SwYULZq93p24ZpQRO97KpR4DUJ1k418ucm/zVD2X+wk6bqGwMbFbRJsXGzUZE1KTbWbhRbTdB8Wud6c9D860ZkTRmC/tQZvNTfKI/IiQS9kSPLgvydaE/+cDt5tH/EfWaKYk6arf1hulv7xMzGA96T3vOhz87z/8pV82O9av1m/XY6luH1nPrjTWyxha2PltfrW/Wv61Hrb9aw9arNfXmjVLzi1U7rZPvQqLB8A=</latexit>

31

Normalisation scales the data linearly so that the smallest point becomes 0 and the largest becomes 1. Note that because x_min is negative (in this example), we are actually moving all data to the right, and then rescaling it. We do this independently for each feature.

whitening

32

1

  • std. dev.

mean 1

x ← x − µ σ

<latexit sha1_base64="8j9BeoJt9JTSHyQ9yl+WPZDB8Bc=">AGlXicdZRLb9NAEIBdoKECi1ckLhYREgIhcgOtIRDUaGhrRBtQtU0leKoWm/GjhW/tLtO4q72r/BruMKdf8M6cSs/wl4ynm9mNK+MGboOZr2d+3O3XvrlfsbD6oPH20+frK1/fSCBhHB0MOBG5BLE1FwHR96zGEuXIYEkGe60DcnBwnvT4FQJ/DPWRzC0EO271gORkyqrZaho35XKiGCxZDhAQz1bAIwjzVv1WNQH54kRDcMKVEHdtDQqhXWzWtoS2eWhb0VKgp6eteba8fGqMARx74DLuI0oGuhWzIEWEOdkFUjYhCiPAE2TCImNUacscPIwY+FuoryazIVmgJlWoI4cAZm4sBYSJIyOoeIxk3kzWs2HouAjD2h9NHVCuhTp1F4KDMlGDfl80UixmfPkNkHh2MHzXGocedRDbFxS0tgz80qIXCBTL69M0pRJFiznQLBDkyZ0ZWc6YTIceh50Uz6OwzH4VPCIuCLrKAEQApZ0XIgUWBTyRTVyIyZ0j5EI6om40O21EZmcwagu4+QU+XQsN0BMyGb4MOB5yF/xI1QLgCDOeNGvSEWrcrSM8Hlsi+mKZ6luAcPc3QU1GM3LultqTNAcvMvCiFLifof2iqxlaFSi0wydliKbswyelfA8Q+clGmdoXKLXGXpdbiWSgx40h3zZ7sWYeMd1pnBEAHzBa01RrIXICQ70vEsyV7TxaLdI7DkgVgCL07M+fH5yXfBD1rNHW1XFC1MN4IbE+3d7s6BVjKxl9mkNlqr1fxSsgkI8u3bQO2vu591rTh9gkupxmqNV0tlWqvMk9zWelgrnJY1rfSflK2PyIo/o91sCr6Tdk3HoWKJ2GyABN5TMPk+CF3OaM2yLNI4EQuRkf+lRELyBu5DcT2HFmb/DXqiVSVl1cv3tmy0Gs2Pja0H+9r+9/SE7yhvFBeKq8VXfmg7CvHSlfpKVj5qfxSfit/Ks8rnyrtyuHS9M5a6vNMyb1K5x+RzV2Z</latexit><latexit sha1_base64="8j9BeoJt9JTSHyQ9yl+WPZDB8Bc=">AGlXicdZRLb9NAEIBdoKECi1ckLhYREgIhcgOtIRDUaGhrRBtQtU0leKoWm/GjhW/tLtO4q72r/BruMKdf8M6cSs/wl4ynm9mNK+MGboOZr2d+3O3XvrlfsbD6oPH20+frK1/fSCBhHB0MOBG5BLE1FwHR96zGEuXIYEkGe60DcnBwnvT4FQJ/DPWRzC0EO271gORkyqrZaho35XKiGCxZDhAQz1bAIwjzVv1WNQH54kRDcMKVEHdtDQqhXWzWtoS2eWhb0VKgp6eteba8fGqMARx74DLuI0oGuhWzIEWEOdkFUjYhCiPAE2TCImNUacscPIwY+FuoryazIVmgJlWoI4cAZm4sBYSJIyOoeIxk3kzWs2HouAjD2h9NHVCuhTp1F4KDMlGDfl80UixmfPkNkHh2MHzXGocedRDbFxS0tgz80qIXCBTL69M0pRJFiznQLBDkyZ0ZWc6YTIceh50Uz6OwzH4VPCIuCLrKAEQApZ0XIgUWBTyRTVyIyZ0j5EI6om40O21EZmcwagu4+QU+XQsN0BMyGb4MOB5yF/xI1QLgCDOeNGvSEWrcrSM8Hlsi+mKZ6luAcPc3QU1GM3LultqTNAcvMvCiFLifof2iqxlaFSi0wydliKbswyelfA8Q+clGmdoXKLXGXpdbiWSgx40h3zZ7sWYeMd1pnBEAHzBa01RrIXICQ70vEsyV7TxaLdI7DkgVgCL07M+fH5yXfBD1rNHW1XFC1MN4IbE+3d7s6BVjKxl9mkNlqr1fxSsgkI8u3bQO2vu591rTh9gkupxmqNV0tlWqvMk9zWelgrnJY1rfSflK2PyIo/o91sCr6Tdk3HoWKJ2GyABN5TMPk+CF3OaM2yLNI4EQuRkf+lRELyBu5DcT2HFmb/DXqiVSVl1cv3tmy0Gs2Pja0H+9r+9/SE7yhvFBeKq8VXfmg7CvHSlfpKVj5qfxSfit/Ks8rnyrtyuHS9M5a6vNMyb1K5x+RzV2Z</latexit><latexit sha1_base64="8j9BeoJt9JTSHyQ9yl+WPZDB8Bc=">AGlXicdZRLb9NAEIBdoKECi1ckLhYREgIhcgOtIRDUaGhrRBtQtU0leKoWm/GjhW/tLtO4q72r/BruMKdf8M6cSs/wl4ynm9mNK+MGboOZr2d+3O3XvrlfsbD6oPH20+frK1/fSCBhHB0MOBG5BLE1FwHR96zGEuXIYEkGe60DcnBwnvT4FQJ/DPWRzC0EO271gORkyqrZaho35XKiGCxZDhAQz1bAIwjzVv1WNQH54kRDcMKVEHdtDQqhXWzWtoS2eWhb0VKgp6eteba8fGqMARx74DLuI0oGuhWzIEWEOdkFUjYhCiPAE2TCImNUacscPIwY+FuoryazIVmgJlWoI4cAZm4sBYSJIyOoeIxk3kzWs2HouAjD2h9NHVCuhTp1F4KDMlGDfl80UixmfPkNkHh2MHzXGocedRDbFxS0tgz80qIXCBTL69M0pRJFiznQLBDkyZ0ZWc6YTIceh50Uz6OwzH4VPCIuCLrKAEQApZ0XIgUWBTyRTVyIyZ0j5EI6om40O21EZmcwagu4+QU+XQsN0BMyGb4MOB5yF/xI1QLgCDOeNGvSEWrcrSM8Hlsi+mKZ6luAcPc3QU1GM3LultqTNAcvMvCiFLifof2iqxlaFSi0wydliKbswyelfA8Q+clGmdoXKLXGXpdbiWSgx40h3zZ7sWYeMd1pnBEAHzBa01RrIXICQ70vEsyV7TxaLdI7DkgVgCL07M+fH5yXfBD1rNHW1XFC1MN4IbE+3d7s6BVjKxl9mkNlqr1fxSsgkI8u3bQO2vu591rTh9gkupxmqNV0tlWqvMk9zWelgrnJY1rfSflK2PyIo/o91sCr6Tdk3HoWKJ2GyABN5TMPk+CF3OaM2yLNI4EQuRkf+lRELyBu5DcT2HFmb/DXqiVSVl1cv3tmy0Gs2Pja0H+9r+9/SE7yhvFBeKq8VXfmg7CvHSlfpKVj5qfxSfit/Ks8rnyrtyuHS9M5a6vNMyb1K5x+RzV2Z</latexit>

32

Another option is whitening. We rescale the data so that the mean becomes zero, and the standard deviation becomes 1. In essence, we are transforming our data so that it looks like it was sampled from a standard normal distribution (as much as we can with a linear transformation)

33

33

Here’s what whitening looks like. If the data is uncorrelated, we are reducing it to a nice spherical distribution, centered on the origin, with the same variance in each direction. If, however, our data is correlated (knowing the value of one feature helps us predict the value of the other), we get a different result. This is because we whiten each feature independently, and the features are not independent. Is there a way to achieve the same effect with the correlated data? Can we transform the features somehow so that it looks like they came from a distribution like the one top right?

22.Methodology2.key - 20 March 2018

slide-12
SLIDE 12

34

34

In essence we want to transform the data top right to something that looks like the data bottom left. Or, the same question asked differently, can we express the data in another coordinate system, to that in the new coordinate system, the features are not correlated and the variance in the direction of each axis is 1?

35

a + b = c

summing vectors

35

To make this more precise, we’ll need some preliminaries. Here’s a reminder of how summing vectors works.

36

bases

b1 = ✓ 1 ◆ b2 = ✓ 1 ◆ x = 3b1 + 2b2 = ✓ 3 2 ◆

<latexit sha1_base64="SGY4dhdFRCRFTdN70lPwnULonNU=">AHE3icdVRbTxNBFN6qVKyioI/6MLHRG3IbhHEGBIEFKLcJBRI2IbMTk+3k+4tM7OlZTJ/w5/ik0/GV3+A/8bZbSF7qfPSk/N939lz63Eij3Jhmn8rt27fmanenb1Xu/9g7uGj+YXHJzyMGYEWCb2QnTmYg0cDaAkqPDiLGDf8eDU6W8m+OkAGKdhcCxGEbR97Aa0SwkW2nUx/8NmRDoXlkIv1Bi2w64NJCRjwWjQ2XZNjJtCDo3HoXsD8h2E1VTobXULIhMLbKIruGbIfIYfqlJXTz3TeomQvnlMIt6XDNfLiL+bq5aKYPlQ1rYtSNyTu8WJj5bHdCEvsQCOJhzs8tMxJtiZmgxANVs2MOESZ97MJ5LqrbUmDKBYQEIVeaKwbe0iEKOkh6lAGRHgjbWDCqI6ASA8zTITudC0fikOAfeCNzoBGfGzygTs2BNZjasthOkY1l1NKl+GoR8kwl5rEPtc96JWcfOQ7eSfEHrCBn3cmaeokC8whMEJ50oRD3ZmDKFkNfhweTvDeKOpBwJWMmaeyQg0AY9DVwtTkIOJIptXofezNcFiaCRm6lvbwqx/BJ2GjpNz5NPpeiEWSjcjgEsS+j7Wg7cjJW0BQyHtxqJKW5VFj5SUdtIXx0FHCZxD9zPovipGbt2gXdTSaA48yYAnpcCnGfS0KHXiDBqX0EGHZQiO5cZ+LIEDzPosISOMuiohF5l0KtyK7Ee9HmzLcftTsckDzw6gG0GEChZb6piLUxP8NzKS5Kpyrql0nZ3oKvP0xjwRwld7hzv7Sq5udpcNldUkeF4MVxTzKWV5U2zRH2Uw45upqc6PECRkO3JtAW59WPlpmcfqMlFKfZIjqFiqV6k6jT3KZKnCmCcb1TeX3y/xthkf/YfTol+Xfa0oVNyPkgXo6ysbJcPe+MZbYE+iwz29GIc6L8yFiF7rbeBuT7Vtelfu5FYNX15reKdLRut5uL7RfPb2/r6l8kJnjWeGs+NV4ZlvDPWjR3j0GgZpPKslH5Wtmtfq/+rP6q/h5Tb1UmidG7lX/AP2CIp3</latexit><latexit sha1_base64="SGY4dhdFRCRFTdN70lPwnULonNU=">AHE3icdVRbTxNBFN6qVKyioI/6MLHRG3IbhHEGBIEFKLcJBRI2IbMTk+3k+4tM7OlZTJ/w5/ik0/GV3+A/8bZbSF7qfPSk/N939lz63Eij3Jhmn8rt27fmanenb1Xu/9g7uGj+YXHJzyMGYEWCb2QnTmYg0cDaAkqPDiLGDf8eDU6W8m+OkAGKdhcCxGEbR97Aa0SwkW2nUx/8NmRDoXlkIv1Bi2w64NJCRjwWjQ2XZNjJtCDo3HoXsD8h2E1VTobXULIhMLbKIruGbIfIYfqlJXTz3TeomQvnlMIt6XDNfLiL+bq5aKYPlQ1rYtSNyTu8WJj5bHdCEvsQCOJhzs8tMxJtiZmgxANVs2MOESZ97MJ5LqrbUmDKBYQEIVeaKwbe0iEKOkh6lAGRHgjbWDCqI6ASA8zTITudC0fikOAfeCNzoBGfGzygTs2BNZjasthOkY1l1NKl+GoR8kwl5rEPtc96JWcfOQ7eSfEHrCBn3cmaeokC8whMEJ50oRD3ZmDKFkNfhweTvDeKOpBwJWMmaeyQg0AY9DVwtTkIOJIptXofezNcFiaCRm6lvbwqx/BJ2GjpNz5NPpeiEWSjcjgEsS+j7Wg7cjJW0BQyHtxqJKW5VFj5SUdtIXx0FHCZxD9zPovipGbt2gXdTSaA48yYAnpcCnGfS0KHXiDBqX0EGHZQiO5cZ+LIEDzPosISOMuiohF5l0KtyK7Ee9HmzLcftTsckDzw6gG0GEChZb6piLUxP8NzKS5Kpyrql0nZ3oKvP0xjwRwld7hzv7Sq5udpcNldUkeF4MVxTzKWV5U2zRH2Uw45upqc6PECRkO3JtAW59WPlpmcfqMlFKfZIjqFiqV6k6jT3KZKnCmCcb1TeX3y/xthkf/YfTol+Xfa0oVNyPkgXo6ysbJcPe+MZbYE+iwz29GIc6L8yFiF7rbeBuT7Vtelfu5FYNX15reKdLRut5uL7RfPb2/r6l8kJnjWeGs+NV4ZlvDPWjR3j0GgZpPKslH5Wtmtfq/+rP6q/h5Tb1UmidG7lX/AP2CIp3</latexit><latexit sha1_base64="SGY4dhdFRCRFTdN70lPwnULonNU=">AHE3icdVRbTxNBFN6qVKyioI/6MLHRG3IbhHEGBIEFKLcJBRI2IbMTk+3k+4tM7OlZTJ/w5/ik0/GV3+A/8bZbSF7qfPSk/N939lz63Eij3Jhmn8rt27fmanenb1Xu/9g7uGj+YXHJzyMGYEWCb2QnTmYg0cDaAkqPDiLGDf8eDU6W8m+OkAGKdhcCxGEbR97Aa0SwkW2nUx/8NmRDoXlkIv1Bi2w64NJCRjwWjQ2XZNjJtCDo3HoXsD8h2E1VTobXULIhMLbKIruGbIfIYfqlJXTz3TeomQvnlMIt6XDNfLiL+bq5aKYPlQ1rYtSNyTu8WJj5bHdCEvsQCOJhzs8tMxJtiZmgxANVs2MOESZ97MJ5LqrbUmDKBYQEIVeaKwbe0iEKOkh6lAGRHgjbWDCqI6ASA8zTITudC0fikOAfeCNzoBGfGzygTs2BNZjasthOkY1l1NKl+GoR8kwl5rEPtc96JWcfOQ7eSfEHrCBn3cmaeokC8whMEJ50oRD3ZmDKFkNfhweTvDeKOpBwJWMmaeyQg0AY9DVwtTkIOJIptXofezNcFiaCRm6lvbwqx/BJ2GjpNz5NPpeiEWSjcjgEsS+j7Wg7cjJW0BQyHtxqJKW5VFj5SUdtIXx0FHCZxD9zPovipGbt2gXdTSaA48yYAnpcCnGfS0KHXiDBqX0EGHZQiO5cZ+LIEDzPosISOMuiohF5l0KtyK7Ee9HmzLcftTsckDzw6gG0GEChZb6piLUxP8NzKS5Kpyrql0nZ3oKvP0xjwRwld7hzv7Sq5udpcNldUkeF4MVxTzKWV5U2zRH2Uw45upqc6PECRkO3JtAW59WPlpmcfqMlFKfZIjqFiqV6k6jT3KZKnCmCcb1TeX3y/xthkf/YfTol+Xfa0oVNyPkgXo6ysbJcPe+MZbYE+iwz29GIc6L8yFiF7rbeBuT7Vtelfu5FYNX15reKdLRut5uL7RfPb2/r6l8kJnjWeGs+NV4ZlvDPWjR3j0GgZpPKslH5Wtmtfq/+rP6q/h5Tb1UmidG7lX/AP2CIp3</latexit>

36

We can see our modern Cartesian coordinate system as made up entirely of the two vectors (1 0) and (0 1). Every point in the plane is just a linear combination of these two. A coordinate like (3 2) means: “sum three copies of b1 and add them to two copies of b2.”

22.Methodology2.key - 20 March 2018

slide-13
SLIDE 13

b1 = ✓1.26 0.9 ◆ b2 = ✓−0.3 0.5 ◆ x = 2.5b1 + 0.5b2 = ✓3 2 ◆ x0 = ✓2.5 0.5 ◆

<latexit sha1_base64="pYKZxB6o76DWksxvPKXrigmVLVA=">AHpnicdVTbtGEKXSpkrVpnXax74sIiS9RBVIur4hMJDGbhIESewYlpXCFNzlakQR4mWxu5SpLBZ9bf6hP9bvyA9kSckCyVX5osGci2ZnBuPTKOTCtv9r3frs89tftO982fnq67vfLt17sLnmaMwICkUcre+ZhDFCYwEKGI4B1lgGM/gqE/Oyrw4RwYD9PkXCwojGIcJOEkJFjo1NXWR48S6V85Cj08REXs+RCEiaQxFizMldN3dz0P2f0D5LxOq2Q9xh5aSF1FTosw4byV7u/XSp3mkqvgzyfyLz8T7e/g9Y1PCroNV/f8C1MXdMy/7F8QZ2rzTeUcLXVtft2+SEzcFZB1p9p1f3bv/rjVOSxZAIEmHOLx2bipHETIQkAtXxMg4UkxkO4DITk/2RDBOaCUiIQg80NskiJFJUDACNQwZERAsdYMJC7YDIFDNMhB5Tp27FIcEx8N54HlK+DPk8WAYC6xmPZF7ugLpbU8qAYToNSV4rTeKY6xZMjSRfxH49CVkEbB7Xk0WZusgGMwdGQl404VR35oQWe8XP09MVPl3QKSRcyYxFqirUADAGEy0sQw4io7J8jV7mGT8ULINeEZa5w2PMZmcw7mfWqJeziRKsVC6GQlckzSOsZ67R5X0BORCer2+KltVRc+UlF7RF9HZwVcQ9U0Deq6TxYoxM0GgNvKiAF4bxsIOm1I/q6CZgc4r6Nxw9q8r8LUB5xU0N9BFBV0Y6PsK+t5sJdaDvnRHctnuckzyJArn8JwBJEp2XdV8C9MTvHTqkmKqsuost1jmOjbtgTiRUGXL85fv1LyaN/dsXdVk+FHGdxQ7O3dnSPboATLalYce3/fWpwUoaTYG10/Mfu745pRDNGozVpb2/72cHT5owYrxv9QzUdZDRj2ATfVXwRoG/SbBswkb+zOQ/Z3jxP+x0k/tNbzYq6CbFTaNuFI2SaLFWM3qaXFScbSkHIM+tgxe63U70QcCi5T9oneMBXGom6F/vV4RdfQ9d5rX2wGbv+gb7/9rfvk5eqw37F+sO5bP1mOtWc9sV5Yp9bAIq0/W3+3/ml9aP/cPmkP2sMl9VZrpfneqn3tvz4BNw6+ag=</latexit><latexit sha1_base64="pYKZxB6o76DWksxvPKXrigmVLVA=">AHpnicdVTbtGEKXSpkrVpnXax74sIiS9RBVIur4hMJDGbhIESewYlpXCFNzlakQR4mWxu5SpLBZ9bf6hP9bvyA9kSckCyVX5osGci2ZnBuPTKOTCtv9r3frs89tftO982fnq67vfLt17sLnmaMwICkUcre+ZhDFCYwEKGI4B1lgGM/gqE/Oyrw4RwYD9PkXCwojGIcJOEkJFjo1NXWR48S6V85Cj08REXs+RCEiaQxFizMldN3dz0P2f0D5LxOq2Q9xh5aSF1FTosw4byV7u/XSp3mkqvgzyfyLz8T7e/g9Y1PCroNV/f8C1MXdMy/7F8QZ2rzTeUcLXVtft2+SEzcFZB1p9p1f3bv/rjVOSxZAIEmHOLx2bipHETIQkAtXxMg4UkxkO4DITk/2RDBOaCUiIQg80NskiJFJUDACNQwZERAsdYMJC7YDIFDNMhB5Tp27FIcEx8N54HlK+DPk8WAYC6xmPZF7ugLpbU8qAYToNSV4rTeKY6xZMjSRfxH49CVkEbB7Xk0WZusgGMwdGQl404VR35oQWe8XP09MVPl3QKSRcyYxFqirUADAGEy0sQw4io7J8jV7mGT8ULINeEZa5w2PMZmcw7mfWqJeziRKsVC6GQlckzSOsZ67R5X0BORCer2+KltVRc+UlF7RF9HZwVcQ9U0Deq6TxYoxM0GgNvKiAF4bxsIOm1I/q6CZgc4r6Nxw9q8r8LUB5xU0N9BFBV0Y6PsK+t5sJdaDvnRHctnuckzyJArn8JwBJEp2XdV8C9MTvHTqkmKqsuost1jmOjbtgTiRUGXL85fv1LyaN/dsXdVk+FHGdxQ7O3dnSPboATLalYce3/fWpwUoaTYG10/Mfu745pRDNGozVpb2/72cHT5owYrxv9QzUdZDRj2ATfVXwRoG/SbBswkb+zOQ/Z3jxP+x0k/tNbzYq6CbFTaNuFI2SaLFWM3qaXFScbSkHIM+tgxe63U70QcCi5T9oneMBXGom6F/vV4RdfQ9d5rX2wGbv+gb7/9rfvk5eqw37F+sO5bP1mOtWc9sV5Yp9bAIq0/W3+3/ml9aP/cPmkP2sMl9VZrpfneqn3tvz4BNw6+ag=</latexit><latexit sha1_base64="pYKZxB6o76DWksxvPKXrigmVLVA=">AHpnicdVTbtGEKXSpkrVpnXax74sIiS9RBVIur4hMJDGbhIESewYlpXCFNzlakQR4mWxu5SpLBZ9bf6hP9bvyA9kSckCyVX5osGci2ZnBuPTKOTCtv9r3frs89tftO982fnq67vfLt17sLnmaMwICkUcre+ZhDFCYwEKGI4B1lgGM/gqE/Oyrw4RwYD9PkXCwojGIcJOEkJFjo1NXWR48S6V85Cj08REXs+RCEiaQxFizMldN3dz0P2f0D5LxOq2Q9xh5aSF1FTosw4byV7u/XSp3mkqvgzyfyLz8T7e/g9Y1PCroNV/f8C1MXdMy/7F8QZ2rzTeUcLXVtft2+SEzcFZB1p9p1f3bv/rjVOSxZAIEmHOLx2bipHETIQkAtXxMg4UkxkO4DITk/2RDBOaCUiIQg80NskiJFJUDACNQwZERAsdYMJC7YDIFDNMhB5Tp27FIcEx8N54HlK+DPk8WAYC6xmPZF7ugLpbU8qAYToNSV4rTeKY6xZMjSRfxH49CVkEbB7Xk0WZusgGMwdGQl404VR35oQWe8XP09MVPl3QKSRcyYxFqirUADAGEy0sQw4io7J8jV7mGT8ULINeEZa5w2PMZmcw7mfWqJeziRKsVC6GQlckzSOsZ67R5X0BORCer2+KltVRc+UlF7RF9HZwVcQ9U0Deq6TxYoxM0GgNvKiAF4bxsIOm1I/q6CZgc4r6Nxw9q8r8LUB5xU0N9BFBV0Y6PsK+t5sJdaDvnRHctnuckzyJArn8JwBJEp2XdV8C9MTvHTqkmKqsuost1jmOjbtgTiRUGXL85fv1LyaN/dsXdVk+FHGdxQ7O3dnSPboATLalYce3/fWpwUoaTYG10/Mfu745pRDNGozVpb2/72cHT5owYrxv9QzUdZDRj2ATfVXwRoG/SbBswkb+zOQ/Z3jxP+x0k/tNbzYq6CbFTaNuFI2SaLFWM3qaXFScbSkHIM+tgxe63U70QcCi5T9oneMBXGom6F/vV4RdfQ9d5rX2wGbv+gb7/9rfvk5eqw37F+sO5bP1mOtWc9sV5Yp9bAIq0/W3+3/ml9aP/cPmkP2sMl9VZrpfneqn3tvz4BNw6+ag=</latexit>

37

bases

37

If we choose different basis vectors, we get a different coordinate system to express our data in.

rebasing

If the columns of B are unit vectors (as well as orthogonal), then B represents an orthonormal basis. In that case:

38

B = ⇥ b1, b2 ⇤ x = Bx0 from standard basis to B x0 = B−1x from B to standard x0 = BTx from B to standard

<latexit sha1_base64="r9i+nurCZk/ZQhHICGoLaQgwto=">AG+nicdVRNb9NAEHULDSVQaOHIxSLlQyiK7JS2qapKpS20QrQNVdJUikO13owTK/7S7jpNutp/wo0TQtz4J5yQ4L+wjpNix8EXj+a9N5p5OxozcGzKNO3X3Pyt2wu5O4t38/fuLz14uLzy6Jz6IcFQx7jkwsTUXBsD+rMZg5cBASQazrQMHv7Ed7oA6G279XYMICWizqebdkYMZm6XD4cvFCf76iGi1jXtPie+MRrQjVMzAfyt21sqwaDAeMW8V19R9tVW+Shny2oi0hZq/XC5oJW30qdlAHwcFZfxVL1cWPhtH4cueAw7iNKmrgWsxRFhNnZA5I2QoBwD3WgGTKr0uK2F4QMPCzUZxKzQifqIRpKbdsEMHOGMkCY2LKCiruIMzk6Pl0KQoecoEW2307oHFI+504YEj61uKDka9iKaXkHYKCro0HqdY4cmlkSiZJh6ZTkLoAOm76WTUpmxyijkAgm0amVCVzpwG0VvRml8d491h0AWPCh4SRySFEgBCwJLCUiBhQEfTSMXpEd3GAmhGIWj3M4BIr0zaBdlnVQi3Y7l+IgJaYHV9h3Xfno3AgEjzfDKJbEyKokeiY4j5fFVM8iOIWeJNATMV25foNal2iKfA8AZ5nCjcSaGNaoYJNMyg/QTaz1Q2rxLwVQYeJNBh0m0GEGvU6g1krkXzoZrnFY7tHz8RPHbsPhwTAE7xQFtOzEPmCT0tiV6VF3QxsrsNlrwXMeAOIzo/qh1/EHy/Ul7XNsQ0w3RCmFC0tY31fS1D6cTdjDlapVLey3B8grzOTaGDtxtv9GyhICSBc0Pa3Fx7t7U3vSIEZ+Ybj6EWdDXjR2cWfdzwTIE5SxCbMJPfy/IPCRr+h+3Pqj7xZqYimKWYGDVRTLURGvVw1IcnVTkxJQDkMeWwLFct1N5IBDzySu5Y6Tj2tIM+TeKURTdc36emeDerm0VdI+vi7svh8f9kXlifJUeanoyqayqxwpVaWuYOW78lP5rfzJidyX3Nfct5g6PzfWPFZSX+7HX7s/hNg=</latexit><latexit sha1_base64="r9i+nurCZk/ZQhHICGoLaQgwto=">AG+nicdVRNb9NAEHULDSVQaOHIxSLlQyiK7JS2qapKpS20QrQNVdJUikO13owTK/7S7jpNutp/wo0TQtz4J5yQ4L+wjpNix8EXj+a9N5p5OxozcGzKNO3X3Pyt2wu5O4t38/fuLz14uLzy6Jz6IcFQx7jkwsTUXBsD+rMZg5cBASQazrQMHv7Ed7oA6G279XYMICWizqebdkYMZm6XD4cvFCf76iGi1jXtPie+MRrQjVMzAfyt21sqwaDAeMW8V19R9tVW+Shny2oi0hZq/XC5oJW30qdlAHwcFZfxVL1cWPhtH4cueAw7iNKmrgWsxRFhNnZA5I2QoBwD3WgGTKr0uK2F4QMPCzUZxKzQifqIRpKbdsEMHOGMkCY2LKCiruIMzk6Pl0KQoecoEW2307oHFI+504YEj61uKDka9iKaXkHYKCro0HqdY4cmlkSiZJh6ZTkLoAOm76WTUpmxyijkAgm0amVCVzpwG0VvRml8d491h0AWPCh4SRySFEgBCwJLCUiBhQEfTSMXpEd3GAmhGIWj3M4BIr0zaBdlnVQi3Y7l+IgJaYHV9h3Xfno3AgEjzfDKJbEyKokeiY4j5fFVM8iOIWeJNATMV25foNal2iKfA8AZ5nCjcSaGNaoYJNMyg/QTaz1Q2rxLwVQYeJNBh0m0GEGvU6g1krkXzoZrnFY7tHz8RPHbsPhwTAE7xQFtOzEPmCT0tiV6VF3QxsrsNlrwXMeAOIzo/qh1/EHy/Ul7XNsQ0w3RCmFC0tY31fS1D6cTdjDlapVLey3B8grzOTaGDtxtv9GyhICSBc0Pa3Fx7t7U3vSIEZ+Ybj6EWdDXjR2cWfdzwTIE5SxCbMJPfy/IPCRr+h+3Pqj7xZqYimKWYGDVRTLURGvVw1IcnVTkxJQDkMeWwLFct1N5IBDzySu5Y6Tj2tIM+TeKURTdc36emeDerm0VdI+vi7svh8f9kXlifJUeanoyqayqxwpVaWuYOW78lP5rfzJidyX3Nfct5g6PzfWPFZSX+7HX7s/hNg=</latexit><latexit sha1_base64="r9i+nurCZk/ZQhHICGoLaQgwto=">AG+nicdVRNb9NAEHULDSVQaOHIxSLlQyiK7JS2qapKpS20QrQNVdJUikO13owTK/7S7jpNutp/wo0TQtz4J5yQ4L+wjpNix8EXj+a9N5p5OxozcGzKNO3X3Pyt2wu5O4t38/fuLz14uLzy6Jz6IcFQx7jkwsTUXBsD+rMZg5cBASQazrQMHv7Ed7oA6G279XYMICWizqebdkYMZm6XD4cvFCf76iGi1jXtPie+MRrQjVMzAfyt21sqwaDAeMW8V19R9tVW+Shny2oi0hZq/XC5oJW30qdlAHwcFZfxVL1cWPhtH4cueAw7iNKmrgWsxRFhNnZA5I2QoBwD3WgGTKr0uK2F4QMPCzUZxKzQifqIRpKbdsEMHOGMkCY2LKCiruIMzk6Pl0KQoecoEW2307oHFI+504YEj61uKDka9iKaXkHYKCro0HqdY4cmlkSiZJh6ZTkLoAOm76WTUpmxyijkAgm0amVCVzpwG0VvRml8d491h0AWPCh4SRySFEgBCwJLCUiBhQEfTSMXpEd3GAmhGIWj3M4BIr0zaBdlnVQi3Y7l+IgJaYHV9h3Xfno3AgEjzfDKJbEyKokeiY4j5fFVM8iOIWeJNATMV25foNal2iKfA8AZ5nCjcSaGNaoYJNMyg/QTaz1Q2rxLwVQYeJNBh0m0GEGvU6g1krkXzoZrnFY7tHz8RPHbsPhwTAE7xQFtOzEPmCT0tiV6VF3QxsrsNlrwXMeAOIzo/qh1/EHy/Ul7XNsQ0w3RCmFC0tY31fS1D6cTdjDlapVLey3B8grzOTaGDtxtv9GyhICSBc0Pa3Fx7t7U3vSIEZ+Ybj6EWdDXjR2cWfdzwTIE5SxCbMJPfy/IPCRr+h+3Pqj7xZqYimKWYGDVRTLURGvVw1IcnVTkxJQDkMeWwLFct1N5IBDzySu5Y6Tj2tIM+TeKURTdc36emeDerm0VdI+vi7svh8f9kXlifJUeanoyqayqxwpVaWuYOW78lP5rfzJidyX3Nfct5g6PzfWPFZSX+7HX7s/hNg=</latexit>

38

Here [a,b] represents the matrix created by concatenating the vectors a and b.

39

39

We can now re-phrase what we’re aiming to do: we want to Kind a set of new basis vectors (preferably orthonormal) so that we can express the data in a coordinate system where the features are not correlated, and the variance is 1 in every direction. If the basis orthonormal, we cannot ensure the latter (by using small vectors in directions of low variance), but we’ll deal with that in a different way.

22.Methodology2.key - 20 March 2018

slide-14
SLIDE 14

multivariate normal distribution (MVN)

40

µ = ✓ 1 −1 ◆ Σ = ✓ 2 0.7 0.7 0.7 ◆

<latexit sha1_base64="MeLczgAXlGE675Vt8cmgxc6FVto=">AHOnicdVRdbxNHF3zYcAFGuCRIo1qgSrkWrsGkvAQiZK0oKqQEOIEKWtFs+Pr9cj7MZqZdWxG8yf4K/0h/S5T1XVt/6Azn4k2vW487J37zn6s65oxuwiArpun+0rly9dr194+atzje379z9duPe/WORZpzAkKRyj8FWEBExhKiP4xDjgOIjgJjt5vjJHLigaXIklwxGMQ4TOqES5M62/jip0T5cabRkx1UxAGENFEsxpLThUae7/oIR+S8WVOI9/v+EGuw3IaTJT/kYx1mWNwK4xQE+Q298ysuJT/TRKnm103b5bHGQHXhV0neocnN27/tUfpySLIZEkwkKcei6TI4W5pCQC3fEzAQyTGQ7hNJOT7ZGiCcskJESjxwabZBGSKcpNQWPKgchoaQJMODUVEJlijok01nWapQkOAbRG8pE2Uo5mEZSGx8H6lFMRd9p6FUIcdsSsmi0ZrCschdtJiGQfNJGQR8HncTOZtmiZXmAvghIrchAPjzD7LZy2O0oMKny7ZFBKhVcYjXRcaADiHiREWoQCZMVXcxjywmdiRPINeHha5nT3MZ4cw7pk6jUSznUmUYqmNGQmckzSOsRm8z7TyJSyk8nt9XVhVRw+1ql5XgA5zuIG+r6Hv9Wrl4SU6QUODNsDjGnhsFT6poSer0iCroZmFzmvo3KocnNfgcwte1NCFhS5r6NJCP9fQz7aV2Az6dDBSpd3FmNR+ROfwhgMkWnUHevUu3Ezw1GtK8qmqrqcLu8cwMfumBOJlTldvj979ptXu9uCFu6lXGUGUwQXFfb5Yte1KGHZTcVxt7cHry1OynESXhba+3nzJ8uxDLOokvS1tazX16+Xn0inFj3q6Buh6y/AjX0auG1wqCdYLShLX8mc1/w/Hyf9jpuoX3qxVsHWKC6MuFCstsfxZzcwyZ/lKxVFJ2QOzbDm8M89t3ywILFP+1LwxHsbUmG+fi+POmafe6vb2w6Gg/7LvhefVr9Viv+k8dL53fnA8Z8t5bx1DpyhQ5x/Wg9a37UetX9v/9n+q/13Sb3SqjQPnMZp/sfbN+avA=</latexit><latexit sha1_base64="MeLczgAXlGE675Vt8cmgxc6FVto=">AHOnicdVRdbxNHF3zYcAFGuCRIo1qgSrkWrsGkvAQiZK0oKqQEOIEKWtFs+Pr9cj7MZqZdWxG8yf4K/0h/S5T1XVt/6Azn4k2vW487J37zn6s65oxuwiArpun+0rly9dr194+atzje379z9duPe/WORZpzAkKRyj8FWEBExhKiP4xDjgOIjgJjt5vjJHLigaXIklwxGMQ4TOqES5M62/jip0T5cabRkx1UxAGENFEsxpLThUae7/oIR+S8WVOI9/v+EGuw3IaTJT/kYx1mWNwK4xQE+Q298ysuJT/TRKnm103b5bHGQHXhV0neocnN27/tUfpySLIZEkwkKcei6TI4W5pCQC3fEzAQyTGQ7hNJOT7ZGiCcskJESjxwabZBGSKcpNQWPKgchoaQJMODUVEJlijok01nWapQkOAbRG8pE2Uo5mEZSGx8H6lFMRd9p6FUIcdsSsmi0ZrCschdtJiGQfNJGQR8HncTOZtmiZXmAvghIrchAPjzD7LZy2O0oMKny7ZFBKhVcYjXRcaADiHiREWoQCZMVXcxjywmdiRPINeHha5nT3MZ4cw7pk6jUSznUmUYqmNGQmckzSOsRm8z7TyJSyk8nt9XVhVRw+1ql5XgA5zuIG+r6Hv9Wrl4SU6QUODNsDjGnhsFT6poSer0iCroZmFzmvo3KocnNfgcwte1NCFhS5r6NJCP9fQz7aV2Az6dDBSpd3FmNR+ROfwhgMkWnUHevUu3Ezw1GtK8qmqrqcLu8cwMfumBOJlTldvj979ptXu9uCFu6lXGUGUwQXFfb5Yte1KGHZTcVxt7cHry1OynESXhba+3nzJ8uxDLOokvS1tazX16+Xn0inFj3q6Buh6y/AjX0auG1wqCdYLShLX8mc1/w/Hyf9jpuoX3qxVsHWKC6MuFCstsfxZzcwyZ/lKxVFJ2QOzbDm8M89t3ywILFP+1LwxHsbUmG+fi+POmafe6vb2w6Gg/7LvhefVr9Viv+k8dL53fnA8Z8t5bx1DpyhQ5x/Wg9a37UetX9v/9n+q/13Sb3SqjQPnMZp/sfbN+avA=</latexit><latexit sha1_base64="MeLczgAXlGE675Vt8cmgxc6FVto=">AHOnicdVRdbxNHF3zYcAFGuCRIo1qgSrkWrsGkvAQiZK0oKqQEOIEKWtFs+Pr9cj7MZqZdWxG8yf4K/0h/S5T1XVt/6Azn4k2vW487J37zn6s65oxuwiArpun+0rly9dr194+atzje379z9duPe/WORZpzAkKRyj8FWEBExhKiP4xDjgOIjgJjt5vjJHLigaXIklwxGMQ4TOqES5M62/jip0T5cabRkx1UxAGENFEsxpLThUae7/oIR+S8WVOI9/v+EGuw3IaTJT/kYx1mWNwK4xQE+Q298ysuJT/TRKnm103b5bHGQHXhV0neocnN27/tUfpySLIZEkwkKcei6TI4W5pCQC3fEzAQyTGQ7hNJOT7ZGiCcskJESjxwabZBGSKcpNQWPKgchoaQJMODUVEJlijok01nWapQkOAbRG8pE2Uo5mEZSGx8H6lFMRd9p6FUIcdsSsmi0ZrCschdtJiGQfNJGQR8HncTOZtmiZXmAvghIrchAPjzD7LZy2O0oMKny7ZFBKhVcYjXRcaADiHiREWoQCZMVXcxjywmdiRPINeHha5nT3MZ4cw7pk6jUSznUmUYqmNGQmckzSOsRm8z7TyJSyk8nt9XVhVRw+1ql5XgA5zuIG+r6Hv9Wrl4SU6QUODNsDjGnhsFT6poSer0iCroZmFzmvo3KocnNfgcwte1NCFhS5r6NJCP9fQz7aV2Az6dDBSpd3FmNR+ROfwhgMkWnUHevUu3Ezw1GtK8qmqrqcLu8cwMfumBOJlTldvj979ptXu9uCFu6lXGUGUwQXFfb5Yte1KGHZTcVxt7cHry1OynESXhba+3nzJ8uxDLOokvS1tazX16+Xn0inFj3q6Buh6y/AjX0auG1wqCdYLShLX8mc1/w/Hyf9jpuoX3qxVsHWKC6MuFCstsfxZzcwyZ/lKxVFJ2QOzbDm8M89t3ywILFP+1LwxHsbUmG+fi+POmafe6vb2w6Gg/7LvhefVr9Viv+k8dL53fnA8Z8t5bx1DpyhQ5x/Wg9a37UetX9v/9n+q/13Sb3SqjQPnMZp/sfbN+avA=</latexit>

40

A multivariate normal distribution is a generalisation of a one-dimensional normal distribution. Its mean is a single point, and its variance is determined by a symmetric matrix called a covariance matrix. The values on the diagonal indicate how much variance there is along each dimension. The off-diagonal elements indicate how much co-variance there is between dimensions.

Fitting an MVN to data

41

m = 1 n X

i

xi X = ⇥ x1, . . . , xn ⇤ − m S = 1 n − 1XXT

<latexit sha1_base64="TM5F/eHo9S9Dc1z6NTrs6oVvrM0=">AHc3icdVRdb9s2FJW7te6ydU23x+2BqLdiGJxActckfQjQNdlaDG2TZXESIPICir6SCeuDICnHLsF/sh+zv7E/sueRkuxIlscH+Kec64uDy9uwGIqpOv+07n3yaf3H3Qfrb1+RePvny8/eSrC5HlnMCQZHGrwIsIKYpDCWVMVwxDjgJYrgMpkcWv5wBFzRLz+WCwSjBUpDSrA0qZvtv/2MKD/BchKEKtEaPTtE/tSkQo6J8rRKTc4XeXJDkT+3P/7Wkn5VsgOIaKoCk+R0rg3J6yM/HmdS9K0kRT6k4xWOdtDd54pqwV0Df2xqYMezLay+eRf9eW7kN9s9d9ctDmoHXhX0nOqc3jy5/5c/zkieQCpJjIW49lwmRwpzSUkMesvPBTBMpjiC61yGByNFU5ZLSIlG3xszGMkM2TdRGPKgch4YQJMODUVEJlg07k0nm81SwlIcQKiP5RJspQzKIykNg82EjNiwfVjxpKFXHMJpTMG60pnAjrRCspFknQTEIeA58lzaRt0zS5xpwDJ1RYE06NMyfMDok4z04rfLJgE0iFVjmPdV1oAOAcQiMsQgEyZ6q4jZnMqTiUPIe+DYvc4THm0zMY902dRqLZThnWGpjRgq3JEsSbObIZ1r5EuZS+f1dXVhVR8+0qYpQGcWbqAfaugHvV5uEJDNDRoA7yogRetwpc19HJdGuQ1NG+hsxo6a1UObmvwbQue19B5C13U0EUL/VhDP7atxOahrwcjVdpdPJM6iekM3nCAVKveQK/fhZsXvPaEvuqufpwu4xhGZRlUCysHT19vz9O62ODgYv3D29zgjiHJYU9/neiyO3RYnKbiqOe3AweN3iZByn0arQ8S97P3vtQiznLF6R9vef/ry9fqIcNK6X3UN1PNQy49oE71qeKMg2CQoTdjIn7b5bzhe/A8721R96c1GBdukWBq1VKy1xOxY2Q3O7ErFcUk5BrNsObw343ZiFgSWGf/RzBiPEmrMP9+30Z2n3vr27sdDAe7L3fd3/qvfqtWuwPnW+cp84PjufsO6+ct86pM3RI51nXWfYuXjwb/fb7tPudyX1XqfSfO0TnfnP5pjr4g=</latexit><latexit sha1_base64="TM5F/eHo9S9Dc1z6NTrs6oVvrM0=">AHc3icdVRdb9s2FJW7te6ydU23x+2BqLdiGJxActckfQjQNdlaDG2TZXESIPICir6SCeuDICnHLsF/sh+zv7E/sueRkuxIlscH+Kec64uDy9uwGIqpOv+07n3yaf3H3Qfrb1+RePvny8/eSrC5HlnMCQZHGrwIsIKYpDCWVMVwxDjgJYrgMpkcWv5wBFzRLz+WCwSjBUpDSrA0qZvtv/2MKD/BchKEKtEaPTtE/tSkQo6J8rRKTc4XeXJDkT+3P/7Wkn5VsgOIaKoCk+R0rg3J6yM/HmdS9K0kRT6k4xWOdtDd54pqwV0Df2xqYMezLay+eRf9eW7kN9s9d9ctDmoHXhX0nOqc3jy5/5c/zkieQCpJjIW49lwmRwpzSUkMesvPBTBMpjiC61yGByNFU5ZLSIlG3xszGMkM2TdRGPKgch4YQJMODUVEJlg07k0nm81SwlIcQKiP5RJspQzKIykNg82EjNiwfVjxpKFXHMJpTMG60pnAjrRCspFknQTEIeA58lzaRt0zS5xpwDJ1RYE06NMyfMDok4z04rfLJgE0iFVjmPdV1oAOAcQiMsQgEyZ6q4jZnMqTiUPIe+DYvc4THm0zMY902dRqLZThnWGpjRgq3JEsSbObIZ1r5EuZS+f1dXVhVR8+0qYpQGcWbqAfaugHvV5uEJDNDRoA7yogRetwpc19HJdGuQ1NG+hsxo6a1UObmvwbQue19B5C13U0EUL/VhDP7atxOahrwcjVdpdPJM6iekM3nCAVKveQK/fhZsXvPaEvuqufpwu4xhGZRlUCysHT19vz9O62ODgYv3D29zgjiHJYU9/neiyO3RYnKbiqOe3AweN3iZByn0arQ8S97P3vtQiznLF6R9vef/ry9fqIcNK6X3UN1PNQy49oE71qeKMg2CQoTdjIn7b5bzhe/A8721R96c1GBdukWBq1VKy1xOxY2Q3O7ErFcUk5BrNsObw343ZiFgSWGf/RzBiPEmrMP9+30Z2n3vr27sdDAe7L3fd3/qvfqtWuwPnW+cp84PjufsO6+ct86pM3RI51nXWfYuXjwb/fb7tPudyX1XqfSfO0TnfnP5pjr4g=</latexit><latexit sha1_base64="TM5F/eHo9S9Dc1z6NTrs6oVvrM0=">AHc3icdVRdb9s2FJW7te6ydU23x+2BqLdiGJxActckfQjQNdlaDG2TZXESIPICir6SCeuDICnHLsF/sh+zv7E/sueRkuxIlscH+Kec64uDy9uwGIqpOv+07n3yaf3H3Qfrb1+RePvny8/eSrC5HlnMCQZHGrwIsIKYpDCWVMVwxDjgJYrgMpkcWv5wBFzRLz+WCwSjBUpDSrA0qZvtv/2MKD/BchKEKtEaPTtE/tSkQo6J8rRKTc4XeXJDkT+3P/7Wkn5VsgOIaKoCk+R0rg3J6yM/HmdS9K0kRT6k4xWOdtDd54pqwV0Df2xqYMezLay+eRf9eW7kN9s9d9ctDmoHXhX0nOqc3jy5/5c/zkieQCpJjIW49lwmRwpzSUkMesvPBTBMpjiC61yGByNFU5ZLSIlG3xszGMkM2TdRGPKgch4YQJMODUVEJlg07k0nm81SwlIcQKiP5RJspQzKIykNg82EjNiwfVjxpKFXHMJpTMG60pnAjrRCspFknQTEIeA58lzaRt0zS5xpwDJ1RYE06NMyfMDok4z04rfLJgE0iFVjmPdV1oAOAcQiMsQgEyZ6q4jZnMqTiUPIe+DYvc4THm0zMY902dRqLZThnWGpjRgq3JEsSbObIZ1r5EuZS+f1dXVhVR8+0qYpQGcWbqAfaugHvV5uEJDNDRoA7yogRetwpc19HJdGuQ1NG+hsxo6a1UObmvwbQue19B5C13U0EUL/VhDP7atxOahrwcjVdpdPJM6iekM3nCAVKveQK/fhZsXvPaEvuqufpwu4xhGZRlUCysHT19vz9O62ODgYv3D29zgjiHJYU9/neiyO3RYnKbiqOe3AweN3iZByn0arQ8S97P3vtQiznLF6R9vef/ry9fqIcNK6X3UN1PNQy49oE71qeKMg2CQoTdjIn7b5bzhe/A8721R96c1GBdukWBq1VKy1xOxY2Q3O7ErFcUk5BrNsObw343ZiFgSWGf/RzBiPEmrMP9+30Z2n3vr27sdDAe7L3fd3/qvfqtWuwPnW+cp84PjufsO6+ct86pM3RI51nXWfYuXjwb/fb7tPudyX1XqfSfO0TnfnP5pjr4g=</latexit>

41

The estimators for the sample mean and sample covariance look like this. Computing these values lets you Kit an MVN to your data.

42

MVNs as transformations

42

One very helpful way to think of MVNs is as transformations of a single standard MVN. The circle containing most of the data in a standard MVN, is transformed to an ellipse containing most of the data in the non-standard MVN.

22.Methodology2.key - 20 March 2018

slide-15
SLIDE 15

MVNs as transformations

Start with N(0, I)

  • Let X ~ N(0, I)
  • Let Y = AX + t
  • Y ~ N(t, AAT)

43

43

We can sample from an n-dimensional Standard MVN (with mean at the origin and the identity matrix I as a covariance matrix) by simply Killing a length n vector with values sampled from a one-dimensional standard normal distribution. If we then transform x by multiplying it by some matrix A and adding some vector t, the result is the same as sampling from an MVN with mean t and covariance AAT.

why don’t we…?

  • Compute S, m from data.
  • Find some A such that S = AAT

(Cholesky decomposition)

  • Use A-1 to whiten the data.

44

44

We can use this information to achieve what we need. But if we do this, we’d be missing a trick. When rebasing a dataset, how the old axes map to the new axes is arbitrary (which of the new axes will be axis 1 is an arbitrary choice). We will see that we can achieve some very interesting effects if we make a more considered choice. SpeciKically, in the new coordinates, we’d like the axes to be ordered by variance: the Kirst axis should be the one along which the variance is the highest.

eigenvectors (of A)

45

45

For that, we’ll need to add one more item to our toolbelt: eigenvectors. If we interpret a matrix A as an operation, we can see that it transforms all vectors. Some vectors get stretched and rotated, some only get stretched. A vector is an eigenvector of a matrix if it only gets stretched (or Klipped), but its direction doesn’t change. source: By TreyGreer62 - Image:Mona Lisa-restored.jpg, CC0, https://commons.wikimedia.org/w/index.php?curid=12768508

22.Methodology2.key - 20 March 2018

slide-16
SLIDE 16

eigenvectors

46

Au = λu

<latexit sha1_base64="1Z6LcLmdTLsacbeD2KZjTlBaxd0=">AG6nicdZTNbtNAEMfdQkMpFo4crGIkBCKIielTXuo1C9ohWgbqZFqNqvZk4Vtb2anedJl3tU3BCwI034QoPwNuwTpzKH8GXTPb3n9HMf1fjUOJxYVl/5+bv3V8oPVh8uPTo8fKTpyurzy54GDEMLRySkH12EAfiBdASniDwmTJAvkPg0unvx/xyAIx7YXAuRhTaPnIDr+thJPTR9UrD9pHoOV25q2wXy+m/SClz27SJLtRBZp7Y9tL1StmqWuPLAa1JCgbyde8Xl34bndCHPkQCEwQ51c1i4q2REx4mIBasiMOFOE+cuEqEt3NtvQCGgkIsDJfadaNiClCMx7C7HgMsCAjHSDMPF3BxD3EBZ61KVsKQ4B8oFXOgOP8knIB+4kEj71JbDsY9qOZMpXYZoz8PDTGsS+Tx2onDIR76TPYSIABv42cO4Td1kTjkEhj0em9DUzpzS+G74edhMeG9EexBwJSNGVDpRA2AMujpxHIQEZXjafSD6PNtwSKoxOH4bPsAsf4ZdCq6TuYg206XhEgobUYANzj0fR0pE2VtAUMhbQrVTW2Kk3PlExeiGOexThDT1L0ROUrt+5o12xpmoEXKXhRKHyZopf5VCdK0ahAByk6KFR2blL4poCHKTos0FGKjgr0NkVvi1YifdFX9bac2D2+JnlKvAEcMoBAyXJd5Wdh+gavatmU+FZluabGdnegq/fDBPijWC6Pzo8/Krm/WV+3NlRe4ZAIphJrbWN93ypI3Ek3icba3KzvFTQhQ4F7V+jg3cZurViIRoySO1GjsfZ+ay/RBguzJeMYZrZsEPd5Y8aXhmgjMrYWLCTH2/qD9kaPQfdTir+tSbmRl0VsbUqGlGriUaP6u+XtQ0XqmITCQHoJctg2P93E71gkAiZG/0G2Ou72kz9K9diaN4n9fy27sYtOrVrar16W150Oy2BeNF8ZL47VRMxrGjnFkNI2WgY0fxi/jt/Gn5Je+lL6Wvk2k83NJznMj85V+/gPtj4A/</latexit><latexit sha1_base64="1Z6LcLmdTLsacbeD2KZjTlBaxd0=">AG6nicdZTNbtNAEMfdQkMpFo4crGIkBCKIielTXuo1C9ohWgbqZFqNqvZk4Vtb2anedJl3tU3BCwI034QoPwNuwTpzKH8GXTPb3n9HMf1fjUOJxYVl/5+bv3V8oPVh8uPTo8fKTpyurzy54GDEMLRySkH12EAfiBdASniDwmTJAvkPg0unvx/xyAIx7YXAuRhTaPnIDr+thJPTR9UrD9pHoOV25q2wXy+m/SClz27SJLtRBZp7Y9tL1StmqWuPLAa1JCgbyde8Xl34bndCHPkQCEwQ51c1i4q2REx4mIBasiMOFOE+cuEqEt3NtvQCGgkIsDJfadaNiClCMx7C7HgMsCAjHSDMPF3BxD3EBZ61KVsKQ4B8oFXOgOP8knIB+4kEj71JbDsY9qOZMpXYZoz8PDTGsS+Tx2onDIR76TPYSIABv42cO4Td1kTjkEhj0em9DUzpzS+G74edhMeG9EexBwJSNGVDpRA2AMujpxHIQEZXjafSD6PNtwSKoxOH4bPsAsf4ZdCq6TuYg206XhEgobUYANzj0fR0pE2VtAUMhbQrVTW2Kk3PlExeiGOexThDT1L0ROUrt+5o12xpmoEXKXhRKHyZopf5VCdK0ahAByk6KFR2blL4poCHKTos0FGKjgr0NkVvi1YifdFX9bac2D2+JnlKvAEcMoBAyXJd5Wdh+gavatmU+FZluabGdnegq/fDBPijWC6Pzo8/Krm/WV+3NlRe4ZAIphJrbWN93ypI3Ek3icba3KzvFTQhQ4F7V+jg3cZurViIRoySO1GjsfZ+ay/RBguzJeMYZrZsEPd5Y8aXhmgjMrYWLCTH2/qD9kaPQfdTir+tSbmRl0VsbUqGlGriUaP6u+XtQ0XqmITCQHoJctg2P93E71gkAiZG/0G2Ou72kz9K9diaN4n9fy27sYtOrVrar16W150Oy2BeNF8ZL47VRMxrGjnFkNI2WgY0fxi/jt/Gn5Je+lL6Wvk2k83NJznMj85V+/gPtj4A/</latexit><latexit sha1_base64="1Z6LcLmdTLsacbeD2KZjTlBaxd0=">AG6nicdZTNbtNAEMfdQkMpFo4crGIkBCKIielTXuo1C9ohWgbqZFqNqvZk4Vtb2anedJl3tU3BCwI034QoPwNuwTpzKH8GXTPb3n9HMf1fjUOJxYVl/5+bv3V8oPVh8uPTo8fKTpyurzy54GDEMLRySkH12EAfiBdASniDwmTJAvkPg0unvx/xyAIx7YXAuRhTaPnIDr+thJPTR9UrD9pHoOV25q2wXy+m/SClz27SJLtRBZp7Y9tL1StmqWuPLAa1JCgbyde8Xl34bndCHPkQCEwQ51c1i4q2REx4mIBasiMOFOE+cuEqEt3NtvQCGgkIsDJfadaNiClCMx7C7HgMsCAjHSDMPF3BxD3EBZ61KVsKQ4B8oFXOgOP8knIB+4kEj71JbDsY9qOZMpXYZoz8PDTGsS+Tx2onDIR76TPYSIABv42cO4Td1kTjkEhj0em9DUzpzS+G74edhMeG9EexBwJSNGVDpRA2AMujpxHIQEZXjafSD6PNtwSKoxOH4bPsAsf4ZdCq6TuYg206XhEgobUYANzj0fR0pE2VtAUMhbQrVTW2Kk3PlExeiGOexThDT1L0ROUrt+5o12xpmoEXKXhRKHyZopf5VCdK0ahAByk6KFR2blL4poCHKTos0FGKjgr0NkVvi1YifdFX9bac2D2+JnlKvAEcMoBAyXJd5Wdh+gavatmU+FZluabGdnegq/fDBPijWC6Pzo8/Krm/WV+3NlRe4ZAIphJrbWN93ypI3Ek3icba3KzvFTQhQ4F7V+jg3cZurViIRoySO1GjsfZ+ay/RBguzJeMYZrZsEPd5Y8aXhmgjMrYWLCTH2/qD9kaPQfdTir+tSbmRl0VsbUqGlGriUaP6u+XtQ0XqmITCQHoJctg2P93E71gkAiZG/0G2Ou72kz9K9diaN4n9fy27sYtOrVrar16W150Oy2BeNF8ZL47VRMxrGjnFkNI2WgY0fxi/jt/Gn5Je+lL6Wvk2k83NJznMj85V+/gPtj4A/</latexit>

46

Here is how express that algebraically. A vector u is an eigenvector of A if multiplying A by u is the same as multiply u by some scalar

  • lambda. lambda is the eigenvalue corresponding to u.

scaling matrix

47

✓ 1

3 4 3

<latexit sha1_base64="PTR/A38dq542MocQFXwHdUC/tRg=">AG/XicdVRNb9NAEDVfAcJXgSMXiwiEUBTZaWnDAQlogaoCGqmQYqjar2ZOKv4Y7W7ThNWK/FDOHNCiBv/BIkT/BTWjhPZcfDFo3nvjWberMalPuHCsn6fO3/h4qXK5StXq9eu37h5a+P2nRMexQxDB0d+xD6iINPQugInz4SBmgwPWh6453E7w7AcZJFB6LGYV+gLyQDAlGQqdONw4cFzwShogwchUmc6QISxtJTeV+dB0xlhaOuksocZYSslOBAOltLTjZrVsNLPLAd2FtSM7Guf3r70xRlEOA4gFNhHnPdsi4q+REwQ7IOqOjEHivAYedCLxbDVlySksYAQK/OBxoaxb4rITOYyB4QBFv5MBwgzoiuYeIR0q0JPXy2W4hCiAHh9MCGUz0M+8eaBQNq6vpym1qobBaX0GKIjgqeF1iQKuLZgVEryWeAWkxD7wCZBMZm0qZtcYU6BYcITE9ramUOarIsfR+0MH83oCEKuZMx8lRdqABiDoRamIQcRU5lOo9/ImD8TLIZ6Eqa5Z3uIjY9gUNd1ColiO0M/QkJpM0I4w1EQIL13hyrpCJgK6dQbKrUqjx4pKZ3EF9c1jxK4gL7Poe/VauXOEh2aHY0WwJMceFIq3M2h3VWpG+fQuIROcuikVNk9y8FnJXiaQ6cldJZDZyX0Uw79VLYS6UX3mn05tztdkz0yQTeMIBQyVpTrc7C9AZ7dlGSbFXWbJXaPYChPhlzIJgldLl/O6tkrut5hNrW60yXD+GBcXa3H6ya5Uo3rybjGO1Ws2XJU7EUOgtC+292n5hlwvRmF/SdrZ2Xz9OXqE2G4NF82hlmzZIf3jp61vBagbtOMDdhLX9c5r9haPYfdrSu+sKbtQq6TrEwaqFYaYkmz0rfbYcmJxX5c8oe6GPL4J1+bof6QCARscf6jTEvINoM/XfqSVTV9xevd7loNsPG1YH7Zqzw+yw37FuGfcNx4ZtrFjPDf2jbRMbDxw/hl/DH+Vj5Xvla+Vb7PqefPZq7RuGr/PwH/bKHOQ=</latexit><latexit sha1_base64="PTR/A38dq542MocQFXwHdUC/tRg=">AG/XicdVRNb9NAEDVfAcJXgSMXiwiEUBTZaWnDAQlogaoCGqmQYqjar2ZOKv4Y7W7ThNWK/FDOHNCiBv/BIkT/BTWjhPZcfDFo3nvjWberMalPuHCsn6fO3/h4qXK5StXq9eu37h5a+P2nRMexQxDB0d+xD6iINPQugInz4SBmgwPWh6453E7w7AcZJFB6LGYV+gLyQDAlGQqdONw4cFzwShogwchUmc6QISxtJTeV+dB0xlhaOuksocZYSslOBAOltLTjZrVsNLPLAd2FtSM7Guf3r70xRlEOA4gFNhHnPdsi4q+REwQ7IOqOjEHivAYedCLxbDVlySksYAQK/OBxoaxb4rITOYyB4QBFv5MBwgzoiuYeIR0q0JPXy2W4hCiAHh9MCGUz0M+8eaBQNq6vpym1qobBaX0GKIjgqeF1iQKuLZgVEryWeAWkxD7wCZBMZm0qZtcYU6BYcITE9ramUOarIsfR+0MH83oCEKuZMx8lRdqABiDoRamIQcRU5lOo9/ImD8TLIZ6Eqa5Z3uIjY9gUNd1ColiO0M/QkJpM0I4w1EQIL13hyrpCJgK6dQbKrUqjx4pKZ3EF9c1jxK4gL7Poe/VauXOEh2aHY0WwJMceFIq3M2h3VWpG+fQuIROcuikVNk9y8FnJXiaQ6cldJZDZyX0Uw79VLYS6UX3mn05tztdkz0yQTeMIBQyVpTrc7C9AZ7dlGSbFXWbJXaPYChPhlzIJgldLl/O6tkrut5hNrW60yXD+GBcXa3H6ya5Uo3rybjGO1Ws2XJU7EUOgtC+292n5hlwvRmF/SdrZ2Xz9OXqE2G4NF82hlmzZIf3jp61vBagbtOMDdhLX9c5r9haPYfdrSu+sKbtQq6TrEwaqFYaYkmz0rfbYcmJxX5c8oe6GPL4J1+bof6QCARscf6jTEvINoM/XfqSVTV9xevd7loNsPG1YH7Zqzw+yw37FuGfcNx4ZtrFjPDf2jbRMbDxw/hl/DH+Vj5Xvla+Vb7PqefPZq7RuGr/PwH/bKHOQ=</latexit><latexit sha1_base64="PTR/A38dq542MocQFXwHdUC/tRg=">AG/XicdVRNb9NAEDVfAcJXgSMXiwiEUBTZaWnDAQlogaoCGqmQYqjar2ZOKv4Y7W7ThNWK/FDOHNCiBv/BIkT/BTWjhPZcfDFo3nvjWberMalPuHCsn6fO3/h4qXK5StXq9eu37h5a+P2nRMexQxDB0d+xD6iINPQugInz4SBmgwPWh6453E7w7AcZJFB6LGYV+gLyQDAlGQqdONw4cFzwShogwchUmc6QISxtJTeV+dB0xlhaOuksocZYSslOBAOltLTjZrVsNLPLAd2FtSM7Guf3r70xRlEOA4gFNhHnPdsi4q+REwQ7IOqOjEHivAYedCLxbDVlySksYAQK/OBxoaxb4rITOYyB4QBFv5MBwgzoiuYeIR0q0JPXy2W4hCiAHh9MCGUz0M+8eaBQNq6vpym1qobBaX0GKIjgqeF1iQKuLZgVEryWeAWkxD7wCZBMZm0qZtcYU6BYcITE9ramUOarIsfR+0MH83oCEKuZMx8lRdqABiDoRamIQcRU5lOo9/ImD8TLIZ6Eqa5Z3uIjY9gUNd1ColiO0M/QkJpM0I4w1EQIL13hyrpCJgK6dQbKrUqjx4pKZ3EF9c1jxK4gL7Poe/VauXOEh2aHY0WwJMceFIq3M2h3VWpG+fQuIROcuikVNk9y8FnJXiaQ6cldJZDZyX0Uw79VLYS6UX3mn05tztdkz0yQTeMIBQyVpTrc7C9AZ7dlGSbFXWbJXaPYChPhlzIJgldLl/O6tkrut5hNrW60yXD+GBcXa3H6ya5Uo3rybjGO1Ws2XJU7EUOgtC+292n5hlwvRmF/SdrZ2Xz9OXqE2G4NF82hlmzZIf3jp61vBagbtOMDdhLX9c5r9haPYfdrSu+sKbtQq6TrEwaqFYaYkmz0rfbYcmJxX5c8oe6GPL4J1+bof6QCARscf6jTEvINoM/XfqSVTV9xevd7loNsPG1YH7Zqzw+yw37FuGfcNx4ZtrFjPDf2jbRMbDxw/hl/DH+Vj5Xvla+Vb7PqefPZq7RuGr/PwH/bKHOQ=</latexit>

47

For some matrices, the eigenvectors align with the axes. That is, the normal basis vectors are eigenvectors. Such matrices, called scaling matrices, are zero everywhere, except on the diagonal.

decomposition

48

U Z UT Ax = UZUTx

48

For any transformation that can’t be expressed as a scaling matrix, we can simply rebase it Kirst, apply a scaling matrix, and then return it to its original base. In this case, the blue and red vectors are the eigenvectors of the operation we want to apply. We Kirst change the base to U so that the eigenvectors are aligned with the axes. Then we apply a scaling matrix Z, and then we undo the change of basis (we require that U is

  • rthonormal, so we can use UT) to return to the original coordinates.

Thus, pre-multiplying by A is the same as premultiplying by UT, Z, and U in order. In other words: we can decompose the matrix A into UZUT.

22.Methodology2.key - 20 March 2018

slide-17
SLIDE 17

singular value decomposition

  • Z is diagonal, U represents an orthonormal basis.
  • The columns of U are the eigenvectors of A
  • The diagonal values of Z are the corresponding

eigenvalues.

  • By convention, the diagonal of Z is sorted from largest

to smallest

49

A = UZUT

49

This is called the singular value decomposition of A. We can search for U and Z through stochastic gradient descent, but there are also more speciKic and efKicient algorithms available. The last bullet point is important. This will ensure that in our new coordinates, the axis with the greatest variance will be the Kirst. The SVD of A is the same as the SVD of the sample covariance, but with the square root of the eigenvalues.

50

S = AAT A = UZUT S = UZUT UZUTT = UZUTUZUT = UZZUT

<latexit sha1_base64="YDOePQxIdn8LB5+Clk9/FBOAnHk=">AHenicdVXbtNAEHW4BcKtwCMvFhGIS6icAG15qAS0XISAltK0iDpU683YsbK2V7vrNG1f8L/8A18DBLr2Cm2N/gls3POGc2eHU08SkIuHOd348zZc+cvNC9eal2+cvXa9ZUbNw94kjIMfZyQhH31EAcSxtAXoSDwlTJAkUfg0BtvZfjhBgPk3hfzCgMIhTEoR9iJHTqeOWX62HpRkiMPF9+Ucq+t2kvji/Vv+j7vu26rX/nMq/rb/A/18sJ7kEfHG/lmNhMBIPdOC2TEH1VGMskq3W8UrbWXmn20G3SJoW8W3e3zj/E93mOA0glhgjg/6jpUDCRiIsQEVMtNOVCExyiAo1T4GwMZxjQVEGNl39WYnxJbJHZmqj0MGWBZjpAmIW6go1HiCEstPWtaikOMYqAd4aTkPI85JMgDwTS7zaQ0/m7qsVpQwYoqMQTyutSRTxzA4jyWeRV01CSoBNomoya1M3WNOgeGQZybsamd2aDYrfD/ZLfDRjI4g5kqmjKiyUAPAGPhaOA85iJTK+W30gI75pmApdLJwntvcRmy8B8OrlNJVNvxSYKE0mbEcIKTKELxULpUSVfAVEi3s6rmVpXRPSWLQfTsvQyuoJ9K6CdVr9w/RX27r9EKeFACD4zChyX0sC710hKaGuikhE6Myt5JCT4x4GkJnRrorITODPRHCf1hWon0Qx/1BjK3e/5McoeE3jLAGIl2z1VvwvTL3jUrUqyV5XtrprbPQRf76sciGYZXb7b/hBya2N3jNnTdUZHklhQXGerD3bcgxKkHdTcJyNjd4rg5MwFAenhbZfr73smoVoyig5Ja2vP3nz/FV9RBg27ldcw253bcOPYBm9aHipwFsmyE1Yyh+b/LcMzf7DTpZVX3izVEGXKRZGLRS1lmg2VmP9f0CzlYpITtkGvWwZfNTjtqMXBIJe6hnjAVRqM3Qv24ni7J93q1vbzPo91afrzqfn7ZfvC8W+0XrtnXHum91rXrhfXO2rX6Fm48bnxpuI3BhT/NdvNB81FOPdMoNLesytd8+hcer6/w</latexit><latexit sha1_base64="YDOePQxIdn8LB5+Clk9/FBOAnHk=">AHenicdVXbtNAEHW4BcKtwCMvFhGIS6icAG15qAS0XISAltK0iDpU683YsbK2V7vrNG1f8L/8A18DBLr2Cm2N/gls3POGc2eHU08SkIuHOd348zZc+cvNC9eal2+cvXa9ZUbNw94kjIMfZyQhH31EAcSxtAXoSDwlTJAkUfg0BtvZfjhBgPk3hfzCgMIhTEoR9iJHTqeOWX62HpRkiMPF9+Ucq+t2kvji/Vv+j7vu26rX/nMq/rb/A/18sJ7kEfHG/lmNhMBIPdOC2TEH1VGMskq3W8UrbWXmn20G3SJoW8W3e3zj/E93mOA0glhgjg/6jpUDCRiIsQEVMtNOVCExyiAo1T4GwMZxjQVEGNl39WYnxJbJHZmqj0MGWBZjpAmIW6go1HiCEstPWtaikOMYqAd4aTkPI85JMgDwTS7zaQ0/m7qsVpQwYoqMQTyutSRTxzA4jyWeRV01CSoBNomoya1M3WNOgeGQZybsamd2aDYrfD/ZLfDRjI4g5kqmjKiyUAPAGPhaOA85iJTK+W30gI75pmApdLJwntvcRmy8B8OrlNJVNvxSYKE0mbEcIKTKELxULpUSVfAVEi3s6rmVpXRPSWLQfTsvQyuoJ9K6CdVr9w/RX27r9EKeFACD4zChyX0sC710hKaGuikhE6Myt5JCT4x4GkJnRrorITODPRHCf1hWon0Qx/1BjK3e/5McoeE3jLAGIl2z1VvwvTL3jUrUqyV5XtrprbPQRf76sciGYZXb7b/hBya2N3jNnTdUZHklhQXGerD3bcgxKkHdTcJyNjd4rg5MwFAenhbZfr73smoVoyig5Ja2vP3nz/FV9RBg27ldcw253bcOPYBm9aHipwFsmyE1Yyh+b/LcMzf7DTpZVX3izVEGXKRZGLRS1lmg2VmP9f0CzlYpITtkGvWwZfNTjtqMXBIJe6hnjAVRqM3Qv24ni7J93q1vbzPo91afrzqfn7ZfvC8W+0XrtnXHum91rXrhfXO2rX6Fm48bnxpuI3BhT/NdvNB81FOPdMoNLesytd8+hcer6/w</latexit><latexit sha1_base64="YDOePQxIdn8LB5+Clk9/FBOAnHk=">AHenicdVXbtNAEHW4BcKtwCMvFhGIS6icAG15qAS0XISAltK0iDpU683YsbK2V7vrNG1f8L/8A18DBLr2Cm2N/gls3POGc2eHU08SkIuHOd348zZc+cvNC9eal2+cvXa9ZUbNw94kjIMfZyQhH31EAcSxtAXoSDwlTJAkUfg0BtvZfjhBgPk3hfzCgMIhTEoR9iJHTqeOWX62HpRkiMPF9+Ucq+t2kvji/Vv+j7vu26rX/nMq/rb/A/18sJ7kEfHG/lmNhMBIPdOC2TEH1VGMskq3W8UrbWXmn20G3SJoW8W3e3zj/E93mOA0glhgjg/6jpUDCRiIsQEVMtNOVCExyiAo1T4GwMZxjQVEGNl39WYnxJbJHZmqj0MGWBZjpAmIW6go1HiCEstPWtaikOMYqAd4aTkPI85JMgDwTS7zaQ0/m7qsVpQwYoqMQTyutSRTxzA4jyWeRV01CSoBNomoya1M3WNOgeGQZybsamd2aDYrfD/ZLfDRjI4g5kqmjKiyUAPAGPhaOA85iJTK+W30gI75pmApdLJwntvcRmy8B8OrlNJVNvxSYKE0mbEcIKTKELxULpUSVfAVEi3s6rmVpXRPSWLQfTsvQyuoJ9K6CdVr9w/RX27r9EKeFACD4zChyX0sC710hKaGuikhE6Myt5JCT4x4GkJnRrorITODPRHCf1hWon0Qx/1BjK3e/5McoeE3jLAGIl2z1VvwvTL3jUrUqyV5XtrprbPQRf76sciGYZXb7b/hBya2N3jNnTdUZHklhQXGerD3bcgxKkHdTcJyNjd4rg5MwFAenhbZfr73smoVoyig5Ja2vP3nz/FV9RBg27ldcw253bcOPYBm9aHipwFsmyE1Yyh+b/LcMzf7DTpZVX3izVEGXKRZGLRS1lmg2VmP9f0CzlYpITtkGvWwZfNTjtqMXBIJe6hnjAVRqM3Qv24ni7J93q1vbzPo91afrzqfn7ZfvC8W+0XrtnXHum91rXrhfXO2rX6Fm48bnxpuI3BhT/NdvNB81FOPdMoNLesytd8+hcer6/w</latexit>

50

If A is a transformation that maps our data distribution to the standard normal distribution, we can show this. Thus, we don’t need to Kind A, we can just Kit the covariance matrix S and take the singular value decomposition of that. The transformation U will rebase our data so that the features aren’t correlated, and the scaling matrix Z will scale along the axes (in the new basis) to ensure that the variance is 1 in all directions.

principal component analysis

  • Mean-center the data
  • Compute the sample covariance S.
  • Take the SVD: S = UZUT.
  • Whiten the data: x <- UTx

51

51

This leads to PCA: a data normalisation method that takes feature correlations into account.

22.Methodology2.key - 20 March 2018

slide-18
SLIDE 18

52

52

Principal component analysis is good for whitening the data, but also for reducing the number of dimensions. If we represent the points of our data only by their Kirst principal component, we project them onto this line. In some sense, this gives us the best linear projection of our data onto one dimension.

summary

PCA:

  • Expresses the data in new coordinates, aligned with the

covariance.

  • The fjrst coordinate (fjrst principal component) is the line

along which the data has the most variance.

  • The second coordinate is the line along which the

remaining variance is the highest (and so on).

  • Representing data by only its fjrst k principal

components, is a great dimensionality reduction method.

53

53

54

54

Here is an example of how PCA is used in research. An anatomist specialising in primates can easily tell for a single bone in isolation that it’s an early hominid fossil (very rare) and not a chimpanzee fossil (not rare). But how to substantiate this? “It’s true because I can see that it is” is not very scientiKic. Here’s one common approach. Take a large collection of the same speciKic bone (the scapula, or shoulder blade, in this case) from different apes and humans, and take a bunch of measurements (features) of each. Do a PCA, and plot the Kirst two principal components. As you can see, the different species form very clear clusters, even in just two dimension. We can now show that a new fossil we’ve found is clearly closer to human than to chimp just by measuring it, and projecting it into this space. source: Fossil hominin shoulders support an African ape-like last common ancestor of humans and chimpanzees. Nathan M. Young, Terence D. Capellini, Neil T. Roach and Zeresenay Alemseged http://www.pnas.org/content/112/38/11829

22.Methodology2.key - 20 March 2018

slide-19
SLIDE 19

55

55

Here is another example. In this research, the authors took a database of 3000 Europeans and extracted features from their DNA. They used about half a million sites on the DNA sequence where DNA varies among humans (i.e. 3000 instances: people, and 500k features: DNA markers). The two principal components of this data largely express how far north the person lives, and how far east the person lives, which means that the large scale geography of Europe can be extracted from our DNA. If I sent a large sample of European DNA to some aliens on the

  • ther side of the galaxy who’d never seen our planet, they could use it to get a rough idea of our geography.

source: Genes mirror geography within Europe John Novembre et al. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2735096/

eigenfaces

from sklearn import datasets faces = datasets.fetch_olivetti_faces()

56

56

Finally, possibly the most magical illustration of PCA: eigenfaces. Here we have a dataset (which you can easily get from sklearn) containing 400 images, in 64x64 grayscale, of a number of people. The lighting is nicely uniform and the facial features are always in approximately the same place. We take each pixel as a feature, giving us 400 instances each represented by a 4096-dimensional feature vector.

mean face

57

57

Here is the sample mean of our data, re-arranged back into an image.

22.Methodology2.key - 20 March 2018

slide-20
SLIDE 20

58

eigenvectors

58

These are the Kirst 30 eigenvectors (top left is the Kirst, to the right of that is the second and so on). We’ve re-arranged them into images, but these are just vectors in our 4096-dimensional space. They are the basis vectors that are most natural for our data.

59

bases

59

60

60

Starting from the mean face, we can take little steps along the direction of one of our eigenvectors. We see that moving along the Kirst eigenvector roughly corresponds to ageing the face. Moving along the fourth seems to make the face more female.

22.Methodology2.key - 20 March 2018

slide-21
SLIDE 21

61

61

Here is the same, but starting at one of the images of the dataset. The middle column represents the starting point. To the right we add the k-th principal component, to the left we subtract it. Note the effect of the Kifth eigenvector: subtracting it opens the mouth, and adding it seems to push the lips closer together.

62

62

If e is the vector describing the image in the new coordinates found by PCA, we can reconstruct the image by starting with the mean (top left), adding e1 times the Kirst eigenvector, adding e2 times the second to that and so on. Since the eigenvectors are arranged by variance, the Kirst has the highest impact. As we add more and more eigenvectors, we get gradually closer to the original image.

63

63

At bottom right is the reconstruction after 60 eigenvectors.

22.Methodology2.key - 20 March 2018

slide-22
SLIDE 22

mlcourse@peterbloem.nl

64

64 22.Methodology2.key - 20 March 2018