linear models
play

Linear Models DS-GA 1013 / MATH-GA 2824 Optimization-based Data - PowerPoint PPT Presentation

Linear Models DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/OBDA_fall17/index.html Carlos Fernandez-Granda Linear regression Least-squares estimation Geometric interpretation Probabilistic


  1. Linear model � 2 π t � � 2 π t � y t ≈ � β 0 + � + � + � � β 1 cos β 2 sin β 3 t 12 12 1 ≤ t ≤ n is the time in months ( n = 12 · 150)

  2. Model fitted by least squares 30 25 20 Temperature (Celsius) 15 10 5 0 Data Model 1860 1880 1900 1920 1940 1960 1980 2000

  3. Model fitted by least squares 25 20 Temperature (Celsius) 15 10 5 Data Model 0 1900 1901 1902 1903 1904 1905

  4. Model fitted by least squares 25 20 Temperature (Celsius) 15 10 5 0 Data Model 5 1960 1961 1962 1963 1964 1965

  5. Trend: Increase of 0.75 ◦ C / 100 years (1.35 ◦ F) 30 25 20 Temperature (Celsius) 15 10 5 0 Data Trend 1860 1880 1900 1920 1940 1960 1980 2000

  6. Model for minimum temperatures 20 15 Temperature (Celsius) 10 5 0 5 Data Model 10 1860 1880 1900 1920 1940 1960 1980 2000

  7. Model for minimum temperatures 14 12 10 Temperature (Celsius) 8 6 4 2 0 Data Model 2 1900 1901 1902 1903 1904 1905

  8. Model for minimum temperatures 15 10 Temperature (Celsius) 5 0 5 Data Model 10 1960 1961 1962 1963 1964 1965

  9. Trend: Increase of 0.88 ◦ C / 100 years (1.58 ◦ F) 20 15 Temperature (Celsius) 10 5 0 5 Data Trend 10 1860 1880 1900 1920 1940 1960 1980 2000

  10. Linear regression Least-squares estimation Geometric interpretation Probabilistic interpretation Analysis of least-squares estimate Noise amplification Ridge regression Classification

  11. Geometric interpretation ◮ Any vector X � β is in the span of the columns of X ◮ The least-squares estimate is the closest vector to � y that can be represented in this way ◮ This is the projection of � y onto the column space of X X � β LS = USV T VS − 1 U T � y = UU T � y

  12. Geometric interpretation

  13. Face denoising We denoise by projecting onto: ◮ S 1 : the span of the 9 images from the same subject ◮ S 2 : the span of the 360 images in the training set Test error: || � x − P S 1 � y || 2 = 0 . 114 || � x || 2 || � x − P S 2 � y || 2 = 0 . 078 || � x || 2

  14. S 1 � � S 1 := span

  15. Denoising via projection onto S 1 Projection Projection onto S ⊥ onto S 1 1 Signal = 0.993 + 0.114 � x + Noise = 0.007 + 0.150 � z = Data = + � y Estimate

  16. S 2 � S 2 := span · · · �

  17. Denoising via projection onto S 2 Projection Projection onto S ⊥ onto S 2 2 Signal = 0.998 + 0.063 � x + Noise = 0.043 + 0.144 � z = Data = + � y Estimate

  18. P S 1 � y and P S 2 � y � P S 1 � P S 2 � x y y

  19. Lessons of Face Denoising What does our intuition learned from Face Denoising tell us about linear regression?

  20. Lessons of Face Denoising What does our intuition learned from Face Denoising tell us about linear regression? ◮ More features = larger column space ◮ Larger column space = captures more of the true image ◮ Larger column space = captures more of the noise ◮ Balance between underfitting and overfitting

  21. Linear regression Least-squares estimation Geometric interpretation Probabilistic interpretation Analysis of least-squares estimate Noise amplification Ridge regression Classification

  22. Motivation Model data y 1 , . . . , y n as realizations of a set of random variables y 1 , . . . , y n The joint pdf depends on a vector of parameters � β β ( y 1 , . . . , y n ) := f y 1 ,..., y n ( y 1 , . . . , y n ) f � is the probability density of y 1 , . . . , y n at the observed data Idea: Choose � β such that the density is as high as possible

  23. Likelihood The likelihood is equal to the joint pdf � � � L y 1 ,..., y n β := f � β ( y 1 , . . . , y n ) interpreted as a function of the parameters � � � The log-likelihood function is the log of the likelihood log L y 1 ,..., y n β

  24. Maximum-likelihood estimator The likelihood quantifies how likely the data are according to the model Maximum-likelihood (ML) estimator : � � � � β ML ( y 1 , . . . , y n ) := arg max L y 1 ,..., y n β � β � � � = arg max log L y 1 ,..., y n β � β Maximizing the log-likelihood is equivalent, and often more convenient

  25. Probabilistic interpretation We model the noise as an iid Gaussian random vector � z Entries have zero mean and variance σ 2 The data are a realization of the random vector y := X � � β + � z y is Gaussian with mean X � β and covariance matrix σ 2 I �

  26. Likelihood The joint pdf of � y is n � � 2 � 1 − 1 � � � � X � y ( � a ) := √ exp � a [ i ] − β [ i ] f � 2 σ 2 2 πσ i = 1 1 � − 1 � 2 � � � � a − X � = ( 2 π ) n σ n exp � � β � � � � 2 σ 2 � � � � 2 The likelihood is � � 1 − 1 2 � � � � � � � y − X � L � β = ( 2 π ) n exp � � β � � � � y � 2 � � � 2

  27. Maximum-likelihood estimate The maximum-likelihood estimate is � � � � β ML = arg max L � β y � β � � � = arg max log L � β y � β 2 � � � � y − X � = arg min � � β � � � � � � � � 2 β = � β LS

  28. Linear regression Least-squares estimation Geometric interpretation Probabilistic interpretation Analysis of least-squares estimate Noise amplification Ridge regression Classification

  29. Estimation error If the data are generated according to the linear model β ∗ + � y := X � � z then β LS − � � β ∗

  30. Estimation error If the data are generated according to the linear model β ∗ + � y := X � � z then � − 1 β ∗ = � X T � β ∗ + � � β LS − � � X � − � β ∗ X T X z

  31. Estimation error If the data are generated according to the linear model β ∗ + � y := X � � z then � − 1 β ∗ = � X T � β ∗ + � � β LS − � � X � − � β ∗ X T X z � − 1 � X T X X T � = z as long as X is full rank

  32. LS estimator is unbiased Assume noise z is random and has zero mean, then � β ∗ � β LS − � � E

  33. LS estimator is unbiased Assume noise z is random and has zero mean, then � − 1 � β ∗ � � β LS − � � X T X X T E ( � = z ) E

  34. LS estimator is unbiased Assume noise z is random and has zero mean, then � − 1 � β ∗ � � β LS − � � X T X X T E ( � = z ) E = 0 The estimate is unbiased: its mean equals � β ∗

  35. Least-squares error If the data are generated according to the linear model β ∗ + � y := X � � z then || � z || 2 2 ≤ || � z || 2 � � β ∗ � � � � β LS − � ≤ � � � � σ 1 σ p � � � σ 1 and σ p are the largest and smallest singular values of X

  36. Least-squares error: Proof The error is given by β ∗ = ( X T X ) − 1 X T � � β LS − � z . How can we bound � ( X T X ) − 1 X T � z � 2 ?

  37. Singular values The singular values of a matrix A ∈ R n × p of rank p satisfy σ 1 = max || A � x || 2 { || � x ∈ R n } x || 2 = 1 | � || A � σ p = min x || 2 { || � x ∈ R n } x || 2 = 1 | �

  38. Least-squares error β ∗ = VS − 1 U T � β LS − � � z The smallest and largest singular values of VS − 1 U are 1 /σ 1 and 1 /σ p , so || � z || 2 2 ≤ || � z || 2 � � � � � VS − 1 U T � ≤ z � � � � σ 1 σ p � � �

  39. Experiment z train and β ∗ are sampled iid from a standard Gaussian X train , X test , � Data has 50 features β ∗ + � y train = X train � � z train y test = X test � β ∗ � (No Test Noise) y train and X train to compute � We use � β LS � � � � � X train � β LS − � y train � � � � � � � 2 error train = || � y train || 2 � � � � � X test � β LS − � y test � � � � � � � 2 error test = || � y test || 2

  40. Experiment 0.5 Error (training) Error (test) Noise level (training) 0.4 Relative error (l2 norm) 0.3 0.2 0.1 0.0 50 100 200 300 400 500 n

  41. Experiment Questions 1. Can we approximate the relative noise level � � z � 2 / � � y � 2 ? 2. Why does the training error start at 0? 3. Why does the relative training error converge to the noise level? 4. Why does the relative test error converge to zero?

  42. Experiment Questions 1. Can we approximate the relative noise level � � z � 2 / � � y � 2 ? √ √ z train � 2 ≈ √ n , � � 50, � X train � β ∗ � 2 ≈ β ∗ � 2 ≈ 1 50 n , � � 51 ≈ 0 . 140 √ 2. Why does the training error start at 0? 3. Why does the relative training error converge to the noise level? 4. Why does the relative test error converge to zero?

  43. Experiment Questions 1. Can we approximate the relative noise level � � z � 2 / � � y � 2 ? √ √ z train � 2 ≈ √ n , � � 50, � X train � β ∗ � 2 ≈ β ∗ � 2 ≈ 1 50 n , � � 51 ≈ 0 . 140 √ 2. Why does the training error start at 0? X is square and invertible 3. Why does the relative training error converge to the noise level? 4. Why does the relative test error converge to zero?

  44. Experiment Questions 1. Can we approximate the relative noise level � � z � 2 / � � y � 2 ? √ √ z train � 2 ≈ √ n , � � 50, � X train � β ∗ � 2 ≈ β ∗ � 2 ≈ 1 50 n , � � 51 ≈ 0 . 140 √ 2. Why does the training error start at 0? X is square and invertible 3. Why does the relative training error converge to the noise level? � X train � y train � 2 = � X train ( � β LS − � z train � 2 and � β LS → � β ∗ ) − � β LS − � β 4. Why does the relative test error converge to zero?

  45. Experiment Questions 1. Can we approximate the relative noise level � � z � 2 / � � y � 2 ? √ √ z train � 2 ≈ √ n , � � 50, � X train � β ∗ � 2 ≈ β ∗ � 2 ≈ 1 50 n , � � 51 ≈ 0 . 140 √ 2. Why does the training error start at 0? X is square and invertible 3. Why does the relative training error converge to the noise level? � X train � y train � 2 = � X train ( � β LS − � z train � 2 and � β LS → � β ∗ ) − � β LS − � β 4. Why does the relative test error converge to zero? We assumed no test noise, and � β LS → � β ∗

  46. Non-asymptotic bound Let β ∗ + � y := X � � z , where the entries of X and � z are iid standard Gaussians The least-squares estimate satisfies � � � p � p ( 1 − ǫ ) ( 1 + ǫ ) � � β ∗ � � � � β LS − � n ≤ 2 ≤ � � � � ( 1 + ǫ ) ( 1 − ǫ ) n � � � � − p ǫ 2 / 8 � with probability at least 1 − 1 / p − 2 exp as long as n ≥ 64 p log( 12 /ǫ ) /ǫ 2

  47. Proof �� � �� � �� � �� � � U T � � U T � z z � � � � � � � VS − 1 U T � 2 2 ≤ 2 ≤ z � � � � σ 1 σ p � � �

  48. Projection onto a fixed subspace Let S be a k -dimensional subspace of R n and � z ∈ R n a vector of iid standard Gaussian noise For any ǫ > 0 � − k ǫ 2 � � � z || 2 k ( 1 − ǫ ) < ||P S � 2 < k ( 1 + ǫ ) ≥ 1 − 2 exp P 8

  49. Projection onto a fixed subspace Let S be a k -dimensional subspace of R n and � z ∈ R n a vector of iid standard Gaussian noise For any ǫ > 0 � − k ǫ 2 � � � z || 2 k ( 1 − ǫ ) < ||P S � 2 < k ( 1 + ǫ ) ≥ 1 − 2 exp P 8 � − p ǫ 2 / 8 � Consequence: With probability 1 − 2 exp 2 � � � � � U T � ( 1 − ǫ ) p ≤ z 2 ≤ ( 1 + ǫ ) p � � � � � � �

  50. Singular values of a Gaussian matrix Let A be a n × k matrix with iid standard Gaussian entries such that n > k For any fixed ǫ > 0, the singular values of A satisfy � � n ( 1 − ǫ ) ≤ σ k ≤ σ 1 ≤ n ( 1 + ǫ ) with probability at least 1 − 1 / k as long as n > 64 k ǫ 2 log 12 ǫ

  51. Proof With probability 1 − 1 / p � � n ( 1 − ǫ ) ≤ σ p ≤ σ 1 ≤ n ( 1 + ǫ ) as long as n ≥ 64 p log( 12 /ǫ ) /ǫ 2

  52. � � � � � � Experiment: β 2 ≈ p � � � � � � � Plot of � � β ∗ − � β LS � 2 � � β ∗ � 2 0.10 p=50 p=100 p=200 Relative coefficient error (l2 norm) 0.08 p n 1 / 0.06 0.04 0.02 0.00 50 5000 10000 15000 20000 n

  53. Linear regression Least-squares estimation Geometric interpretation Probabilistic interpretation Analysis of least-squares estimate Noise amplification Ridge regression Classification

  54. Condition number The condition number of A ∈ R n × p , n ≥ p , is the ratio σ 1 /σ p of its largest and smallest singular values A matrix is ill conditioned if its condition is large (almost rank defficient)

  55. Noise amplification Let β ∗ + � y := X � � z , where � z is iid standard Gaussian � − ǫ 2 / 8 � With probability at least 1 − 2 exp √ 1 − ǫ � � β ∗ � � � � β LS − � 2 ≥ � � � � σ p � � � where σ p is the smallest singular value of X

  56. Proof 2 � � β ∗ � � � � β LS − � � � � � � � � 2

  57. Proof 2 2 � � β ∗ � � � � � � � � β LS − � � VS − 1 U T � 2 = z � � � � � � � � � � � � � � 2

  58. Proof 2 2 � � β ∗ � � � � � � � � β LS − � � VS − 1 U T � 2 = z � � � � � � � � � � � � � � 2 2 � � � � � S − 1 U T � = z V is orthogonal � � � � � � � 2

  59. Proof 2 2 � � β ∗ � � � � � � � � β LS − � � VS − 1 U T � 2 = z � � � � � � � � � � � � � � 2 2 � � � � � S − 1 U T � = z V is orthogonal � � � � � � � 2 p � 2 � � u T i � z � = σ 2 i i

  60. Proof 2 2 � � β ∗ � � � � � � � � β LS − � � VS − 1 U T � 2 = z � � � � � � � � � � � � � � 2 2 � � � � � S − 1 U T � = z V is orthogonal � � � � � � � 2 p � 2 � � u T i � z � = σ 2 i i � 2 � � u T p � z ≥ σ 2 p

  61. Projection onto a fixed subspace Let S be a k -dimensional subspace of R n and � z ∈ R n a vector of iid standard Gaussian noise For any ǫ > 0 � − k ǫ 2 � � � z || 2 k ( 1 − ǫ ) < ||P S � 2 < k ( 1 + ǫ ) ≥ 1 − 2 exp P 8

  62. Projection onto a fixed subspace Let S be a k -dimensional subspace of R n and � z ∈ R n a vector of iid standard Gaussian noise For any ǫ > 0 � − k ǫ 2 � � � z || 2 k ( 1 − ǫ ) < ||P S � 2 < k ( 1 + ǫ ) ≥ 1 − 2 exp P 8 � − ǫ 2 / 8 � Consequence: With probability 1 − 2 exp � 2 � u T � p � z ≥ ( 1 − ǫ )

  63. Example Let β ∗ + � y := X � � z where  0 . 212 − 0 . 099   0 . 066  0 . 605 − 0 . 298 − 0 . 077   � 0 . 471     �   − 0 . 213 0 . 113 − 0 . 010 β ∗ := �     X := , , � z :=     0 . 589 − 0 . 285 − 1 . 191 − 0 . 033         0 . 016 0 . 006 0 . 010     0 . 059 0 . 032 0 . 028 || z || 2 = 0 . 11

  64. Example Condition number = 100  − 0 . 234 0 . 427  − 0 . 674 − 0 . 202     � 1 . 00 � � − 0 . 898 � 0 . 241 0 . 744 0 0 . 440 X = USV T =     − 0 . 654 0 . 350 0 0 . 01 0 . 440 0 . 898     0 . 017 − 0 . 189   0 . 067 0 . 257

  65. Example � β LS − � β ∗

  66. Example β ∗ = VS − 1 U T � � β LS − � z

  67. Example β ∗ = VS − 1 U T � � β LS − � z � 1 . 00 � 0 U T � = V z 0 100 . 00

  68. Example β ∗ = VS − 1 U T � β LS − � � z � 1 . 00 � 0 U T � = V z 0 100 . 00 � 0 . 058 � = V 3 . 004

  69. Example β ∗ = VS − 1 U T � β LS − � � z � 1 . 00 � 0 U T � = V z 0 100 . 00 � 0 . 058 � = V 3 . 004 � 1 . 270 � = 2 . 723

  70. Example β ∗ = VS − 1 U T � � β LS − � z � 1 . 00 � 0 U T � = V z 0 100 . 00 � 0 . 058 � = V 3 . 004 � 1 . 270 � = 2 . 723 so that � � β ∗ � � � � β LS − � � � � � � � � 2 = 27 . 00 || � z || 2

  71. Multicollinearity Feature matrix is ill conditioned if any subset of columns is close to being linearly dependent (there is a vector almost in the null space) This occurs if features are highly correlated For any X ∈ R n × p , with normalized columns, if X i and X j , i � = j , satisfy � X i , X j � 2 ≥ 1 − ǫ 2 then the smallest singular value σ p ≤ ǫ

  72. Multicollinearity Feature matrix is ill conditioned if any subset of columns is close to being linearly dependent (there is a vector almost in the null space) This occurs if features are highly correlated For any X ∈ R n × p , with normalized columns, if X i and X j , i � = j , satisfy � X i , X j � 2 ≥ 1 − ǫ 2 then the smallest singular value σ p ≤ ǫ Proof Idea: Consider � X ( � e i − � e j ) � 2 .

  73. Linear regression Least-squares estimation Geometric interpretation Probabilistic interpretation Analysis of least-squares estimate Noise amplification Ridge regression Classification

  74. Motivation Avoid noise amplification due to multicollinearity Problem: Noise amplification blows up coefficients Solution: Penalize large-norm solutions when fitting the model Adding a penalty term promoting a particular structure is called regularization

  75. Ridge regression For a fixed regularization parameter λ > 0 2 2 � � � � � � � � � y − X � � � � � β ridge := arg min β 2 + λ β � � � � � � � � � � � � � � � 2 β

  76. Ridge regression For a fixed regularization parameter λ > 0 2 2 � � � � � � � � � y − X � � � � � β ridge := arg min β 2 + λ β � � � � � � � � � � � � � � � 2 β � − 1 � X T X + λ I X T � = y

Recommend


More recommend