probability and statistics
play

Probability and Statistics for Computer Science All models are - PowerPoint PPT Presentation

Probability and Statistics for Computer Science All models are wrong, but some models are useful--- George Box Credit: wikipedia Hongye Liu, Teaching Assistant Prof, CS361, UIUC, 11.17.2020 Last time StochasOc Gradient Descent


  1. Probability and Statistics ì for Computer Science “All models are wrong, but some models are useful”--- George Box Credit: wikipedia Hongye Liu, Teaching Assistant Prof, CS361, UIUC, 11.17.2020

  2. Last time � StochasOc Gradient Descent � Naïve Bayesian Classifier } classifier - Regression

  3. Some popular topics in Ngram

  4. Objectives Linear regression detrition . * solution least square * The prediction and * Training evaluating else fit for R - squares * .

  5. Regression models are Machine learning methods � Regression models have been around for a while � Dr. Kelvin Murphy’s Machine Learning book has 3+ chapters on regression

  6. The regression problem yo ! d ' - sets - x' " Y x ; classification . . * xY ¥? - Regression Y d ' ' 's . x' gut y . . I. 56 * I 0.5 is YP ? -

  7. Chicago social economic census � The census included 77 communiOes in Chicago � The census evaluated the average hardship index of the residents � The census evaluated the following parameters for each community: PERCENT_OF_ HOUSING_CROWDED � PERCENT_ HOUSEHOLD_BELOW_POVERTY � PERCENT_ AGED_16p_UNEMPLOYED � PERCENT_ AGED_25p_WITHOUT_HIGH_SCHOOL_DIPLOMA � PERCENT_ AGED_UNDER_18_OR_OVER_64 � PER_CAPITA_ INCOME � Given a new community and its parameters, can you predict its average hardship index with all these parameters?

  8. Wait, have we seen the linear regression before? T X Correlation T iii. : ÷÷÷ : "

  9. It’s about Relationship between data features � Example: Is the height of people related to their weight? � x : HIGHT, y: WEIGHT

  10. Some terminology � Suppose the dataset consists of N labeled { ( x , y ) } items ( x i , y i ) � If we represent the dataset as a table � The d columns represenOng are called { x } ÷ explanatory variables x ( j ) � The numerical column y x (1) x (2) y is called the dependent 1 3 0 w/ variable 2 3 2 3 6 5

  11. Variables of the Chicago census [1] "PERCENT_OF_HOUSING_CROWDED" [2]"PERCENT_HOUSEHOLDS_BELOW_POVERTY" [3] "PERCENT_AGED_16p_UNEMPLOYED" [4]"PERCENT_AGED_25p_WITHOUT_HIGH_SCHOOL_DI PLOMA" [5] "PERCENT_AGED_UNDER_18_OR_OVER_64" [6]"PER_CAPITA_INCOME" [7] "HardshipIndex"

  12. Which is the dependent variable in the census example? A. "PERCENT_OF_HOUSING_CROWDED" B. "PERCENT_AGED_25p_WITHOUT_HIGH_SCHOOL_DIPLOMA” e C. "HardshipIndex” D. "PERCENT_AGED_UNDER_18_OR_OVER_64"

  13. Linear model x ( j ) � We begin by modeling y as a linear funcOon of re i plus randomness y = x (1) β 1 + x (2) β 2 + ... + x ( d ) β d + ξ Where is a zero-mean random variable that ξ represents model error x " d ' ) xT=[ x " - x' � In vector notaOon: x (1) x (2) y y = x T β + ξ 1 3 0 Where is the d-dimensional 2 3 2 β vector of coefficients that we train 3 6 5

  14. Each data item gives an equation � The model: y = x T β + ξ = x (1) β 1 + x (2) β 2 + ξ I xp , -13 * Bz tf y = l o = = zxp , t u t f ~ 3 tf Z Training data 5=3 xp it Gxpu -193 x (1) x (2) y 1 3 0 2 3 2 3 6 5

  15. Which together form a matrix equation � The model y = x T β + ξ = x (1) β 1 + x (2) β 2 + ξ ECT 1=0       ξ 1 0 1 3 � β 1 �  = Training data ξ 2 2 2 3 +      β 2 ξ 3 5 3 6 . t.tk x (1) x (2) y ' - tx 17 1 3 0 y 2 3 2 - . 3 6 5 w'kno If

  16. Which together form a matrix equation � The model y = x T β + ξ = x (1) β 1 + x (2) β 2 + ξ       ξ 1 0 1 3 � β 1 � I  = Training data ξ 2 2 2 3 +      β 2 ξ 3 5 3 6 x (1) x (2) y 1 3 0 y = X · β + e 2 3 2 3 6 5

  17. Q. What’s the dimension of matrix X? o A. N × d B. d × N C. N × N D. d × d

  18. Training the model is to choose β � Given a training dataset , we want to fit a { ( x , y ) } model y = x T β + ξ       ξ 1 y 1 x T 1 � Define and and . .   . . y = X =     . e = . . . .       x T ξ N y N N � To train the model, we need to choose that makes β e small in the matrix equaOon y = X · β + e ① Least Square ② MLE = Textbook loss function 309 pg

  19. Training using least squares � In the least squares method, we aim to minimize � e � 2 Loss ( cost ' II . � e � 2 = � y − X β � 2 = ( y − X β ) T ( y − X β ) is suit � DifferenOaOng with respect to and semng to zero β " × F=xty X T X β − X T y = 0 O - � If is inverOble, the least squares esOmate of X T X the coefficient is: Hell ' F= arguing ① � β = ( X T X ) − 1 X T y - Xp te Y -

  20. XTX XT ~ Ix N Nxd X ~ + ns.XN.d > ai XTX ~ dxd = A real wagged symmetric . XTX , we n :3 o fr have -

  21. ⇒ Derivation of least square solution = Cy - xp 5cg - xp , - Hell = yty - pTxTy - ytxptpixtxp all vector lnratr :X involving derivative useful a square matrix A is a , b are vectors ; 2la}Aa# = ( A+ AT , a 21bII=b ⇒ ym= × Ty# ⇒ 2lMfTpzxTxp W is symmetric XTX bta scalar is 2fpT × Ty , . since . ⇒ zp- . Hb'=2lbIaa=2'f=b x )T Y × T × = = . : ' is , , × T × + ( XTXJTIZXTX all items c ' ) Note yell scalar , in scalar are 211 ell ' o - XTY - xTytzxTxp=o Ip = ⇒ xTxp=5# ✓ - - ' x' B. = cxtx , y vector is here y

  22. ⇒ Derivation of least square solution xty-xtx.pt X' ' ndxn p' 1=0 - x' ⇒ xtly e - ax , Cdt ' I ⇒ XTe=o # e) Io lied ) eTX=o city ^ ⇒ eTxp=o ^ et XP uncorrelated ! :

  23. Loss function ( east square Hell ? ftp.s-EQjcps-7 ? ,cxTjp-ygj2 K - j ja ' Lei - yjs Qjcps-cxt.jp the final project in I Qjl 01=1540 - yjl I Qj= ? 2 = ? 20

  24. Convex set and convex function � If a set is convex, full = o - any line connecOng VE two points in the set is completely . included in the set � A convex funcOon: the area above the curve is convex f ( λ x + (1 − λ ) y ) < λ f ( x ) + (1 − λ ) f ( y ) � The least square - funcOon is convex - Credit: Dr. Kelvin Murphy

  25. What’s the dimension of matrix X T X? X n Nx d A. N × d B. d × N XT ~ C. N × N Ix N D D. d × d Xtx - did → # of features d ( explanatory van .

  26. Is this statement true? If the matrix X T X does NOT have zero valued eigenvalues, it is inverOble. Rizo El A. TRUE it dit - B. FALSE to - all i t s o

  27. Training using least squares example 0 ¥ � Model: y = x T β + ξ = x (1) β 1 + x (2) β 2 + ξ � � 2 � β = ( X T X ) − 1 X T y = − 1 Training data 3 x (1) x (2) y � β 1 = 2 1 3 0 β 2 = − 1 � 2 3 2 3 3 6 5

  28. Prediction � If we train the model coefficients , we can predict � y p β 0 from x 0 0 � y p 0 = x T β � � 2 � � In the model with y = x (1) β 1 + x (2) β 2 + ξ β = − 1 Tf 3 � � = ztf.tl/- fu 2 � The predicOon for is y p x 0 = 0 1 - It } • = 4 � � 0 � The predicOon for is y p x 0 = 0 0

  29. A linear model with constant offset � The problem with the model y = x (1) β 1 + x (2) β 2 + ξ when x° y is: - � Let’s add a constant offset to the model β 0 y = β 0 + x (1) β 1 + x (2) β 2 + ξ " . ft ftp.t x - - .

  30. Training and prediction with constant offset � The model y = β 0 + x (1) β 1 + x (2) β 2 + ξ = x T β + ξ - r . jjtaant � Training data: � 1 x (1) x (2) �   x (1) x (2) y 1 − 3 El �   1 1 3 0 β = ( X T X ) − 1 X T y = 2 1 1 2 3 2 3 1 3 6 5   − 3 � � 0 � �  = − 3 y p � For 0 = 1 0 0 2 x 0 =  0 1 3

  31. Comparing our example models y = β 0 + x (1) β 1 + x (2) β 2 + ξ y = x (1) β 1 + x (2) β 2 + ξ   � � − 3 2 � �   β = β = 2 − 1 1 3 x T � x T � 3 x (1) x (2) x (1) x (2) β y 1 β y y -54=4 1 3 0 1 1 1 3 0 0 2 3 2 3 1 2 3 2 2 - l 1 3 6 5 5 3 6 5 4 l

  32. Variance of the linear regression model � The least squares esOmate saOsfies this property i � var ( { y i } ) = var ( { x T β } ) + var ( { ξ i } ) - X Este e. txt y - � The random error is uncorrelated to the least square soluOon of linear combinaOon of explanatory variables. ' pi = XT x XT y

  33. Variance of the linear regression model: proof � The least squares esOmate saOsfies this property = X ft e Y i � var ( { y i } ) = var ( { x T β } ) + var ( { ξ i } ) var ( Y ) = var ( X f ) t var Ce ) Proof: t z CoV l Xp , e ) xp t e . : CoV l X f , e ) = o

  34. Variance of the linear regression model: proof � The least squares esOmate saOsfies this property i � var ( { y i } ) = var ( { x T β } ) + var ( { ξ i } ) Proof: var [ y ] = (1 /N )([ X ˆ β − X ˆ β ] + [ e − e ]) T ([ X ˆ β − X ˆ β ] + [ e − e ]) var [ y ] = (1 /N )([ X ˆ β − X ˆ β ] T [ X ˆ β − X ˆ β ]+2[ e − e ] T [ X ˆ β − X ˆ β ]+[ e − e ] T [ e − e ]) e T X � Because ; and Due to Least square minimized e T 1 = 0 e = 0 β = 0 var [ y ] = (1 /N )([ X ˆ β − X ˆ β ] T [ X ˆ β − X ˆ β ] + [ e − e ] T [ e − e ]) = var [ X � var [ y ] β ] + var [ e ]

Recommend


More recommend