Probability and Statistics ì for Computer Science “All models are wrong, but some models are useful”--- George Box Credit: wikipedia Hongye Liu, Teaching Assistant Prof, CS361, UIUC, 11.17.2020
Last time � StochasOc Gradient Descent � Naïve Bayesian Classifier } classifier - Regression
Some popular topics in Ngram
Objectives Linear regression detrition . * solution least square * The prediction and * Training evaluating else fit for R - squares * .
Regression models are Machine learning methods � Regression models have been around for a while � Dr. Kelvin Murphy’s Machine Learning book has 3+ chapters on regression
The regression problem yo ! d ' - sets - x' " Y x ; classification . . * xY ¥? - Regression Y d ' ' 's . x' gut y . . I. 56 * I 0.5 is YP ? -
Chicago social economic census � The census included 77 communiOes in Chicago � The census evaluated the average hardship index of the residents � The census evaluated the following parameters for each community: PERCENT_OF_ HOUSING_CROWDED � PERCENT_ HOUSEHOLD_BELOW_POVERTY � PERCENT_ AGED_16p_UNEMPLOYED � PERCENT_ AGED_25p_WITHOUT_HIGH_SCHOOL_DIPLOMA � PERCENT_ AGED_UNDER_18_OR_OVER_64 � PER_CAPITA_ INCOME � Given a new community and its parameters, can you predict its average hardship index with all these parameters?
Wait, have we seen the linear regression before? T X Correlation T iii. : ÷÷÷ : "
It’s about Relationship between data features � Example: Is the height of people related to their weight? � x : HIGHT, y: WEIGHT
Some terminology � Suppose the dataset consists of N labeled { ( x , y ) } items ( x i , y i ) � If we represent the dataset as a table � The d columns represenOng are called { x } ÷ explanatory variables x ( j ) � The numerical column y x (1) x (2) y is called the dependent 1 3 0 w/ variable 2 3 2 3 6 5
Variables of the Chicago census [1] "PERCENT_OF_HOUSING_CROWDED" [2]"PERCENT_HOUSEHOLDS_BELOW_POVERTY" [3] "PERCENT_AGED_16p_UNEMPLOYED" [4]"PERCENT_AGED_25p_WITHOUT_HIGH_SCHOOL_DI PLOMA" [5] "PERCENT_AGED_UNDER_18_OR_OVER_64" [6]"PER_CAPITA_INCOME" [7] "HardshipIndex"
Which is the dependent variable in the census example? A. "PERCENT_OF_HOUSING_CROWDED" B. "PERCENT_AGED_25p_WITHOUT_HIGH_SCHOOL_DIPLOMA” e C. "HardshipIndex” D. "PERCENT_AGED_UNDER_18_OR_OVER_64"
Linear model x ( j ) � We begin by modeling y as a linear funcOon of re i plus randomness y = x (1) β 1 + x (2) β 2 + ... + x ( d ) β d + ξ Where is a zero-mean random variable that ξ represents model error x " d ' ) xT=[ x " - x' � In vector notaOon: x (1) x (2) y y = x T β + ξ 1 3 0 Where is the d-dimensional 2 3 2 β vector of coefficients that we train 3 6 5
Each data item gives an equation � The model: y = x T β + ξ = x (1) β 1 + x (2) β 2 + ξ I xp , -13 * Bz tf y = l o = = zxp , t u t f ~ 3 tf Z Training data 5=3 xp it Gxpu -193 x (1) x (2) y 1 3 0 2 3 2 3 6 5
Which together form a matrix equation � The model y = x T β + ξ = x (1) β 1 + x (2) β 2 + ξ ECT 1=0 ξ 1 0 1 3 � β 1 � = Training data ξ 2 2 2 3 + β 2 ξ 3 5 3 6 . t.tk x (1) x (2) y ' - tx 17 1 3 0 y 2 3 2 - . 3 6 5 w'kno If
Which together form a matrix equation � The model y = x T β + ξ = x (1) β 1 + x (2) β 2 + ξ ξ 1 0 1 3 � β 1 � I = Training data ξ 2 2 2 3 + β 2 ξ 3 5 3 6 x (1) x (2) y 1 3 0 y = X · β + e 2 3 2 3 6 5
Q. What’s the dimension of matrix X? o A. N × d B. d × N C. N × N D. d × d
Training the model is to choose β � Given a training dataset , we want to fit a { ( x , y ) } model y = x T β + ξ ξ 1 y 1 x T 1 � Define and and . . . . y = X = . e = . . . . x T ξ N y N N � To train the model, we need to choose that makes β e small in the matrix equaOon y = X · β + e ① Least Square ② MLE = Textbook loss function 309 pg
Training using least squares � In the least squares method, we aim to minimize � e � 2 Loss ( cost ' II . � e � 2 = � y − X β � 2 = ( y − X β ) T ( y − X β ) is suit � DifferenOaOng with respect to and semng to zero β " × F=xty X T X β − X T y = 0 O - � If is inverOble, the least squares esOmate of X T X the coefficient is: Hell ' F= arguing ① � β = ( X T X ) − 1 X T y - Xp te Y -
XTX XT ~ Ix N Nxd X ~ + ns.XN.d > ai XTX ~ dxd = A real wagged symmetric . XTX , we n :3 o fr have -
⇒ Derivation of least square solution = Cy - xp 5cg - xp , - Hell = yty - pTxTy - ytxptpixtxp all vector lnratr :X involving derivative useful a square matrix A is a , b are vectors ; 2la}Aa# = ( A+ AT , a 21bII=b ⇒ ym= × Ty# ⇒ 2lMfTpzxTxp W is symmetric XTX bta scalar is 2fpT × Ty , . since . ⇒ zp- . Hb'=2lbIaa=2'f=b x )T Y × T × = = . : ' is , , × T × + ( XTXJTIZXTX all items c ' ) Note yell scalar , in scalar are 211 ell ' o - XTY - xTytzxTxp=o Ip = ⇒ xTxp=5# ✓ - - ' x' B. = cxtx , y vector is here y
⇒ Derivation of least square solution xty-xtx.pt X' ' ndxn p' 1=0 - x' ⇒ xtly e - ax , Cdt ' I ⇒ XTe=o # e) Io lied ) eTX=o city ^ ⇒ eTxp=o ^ et XP uncorrelated ! :
Loss function ( east square Hell ? ftp.s-EQjcps-7 ? ,cxTjp-ygj2 K - j ja ' Lei - yjs Qjcps-cxt.jp the final project in I Qjl 01=1540 - yjl I Qj= ? 2 = ? 20
Convex set and convex function � If a set is convex, full = o - any line connecOng VE two points in the set is completely . included in the set � A convex funcOon: the area above the curve is convex f ( λ x + (1 − λ ) y ) < λ f ( x ) + (1 − λ ) f ( y ) � The least square - funcOon is convex - Credit: Dr. Kelvin Murphy
What’s the dimension of matrix X T X? X n Nx d A. N × d B. d × N XT ~ C. N × N Ix N D D. d × d Xtx - did → # of features d ( explanatory van .
Is this statement true? If the matrix X T X does NOT have zero valued eigenvalues, it is inverOble. Rizo El A. TRUE it dit - B. FALSE to - all i t s o
Training using least squares example 0 ¥ � Model: y = x T β + ξ = x (1) β 1 + x (2) β 2 + ξ � � 2 � β = ( X T X ) − 1 X T y = − 1 Training data 3 x (1) x (2) y � β 1 = 2 1 3 0 β 2 = − 1 � 2 3 2 3 3 6 5
Prediction � If we train the model coefficients , we can predict � y p β 0 from x 0 0 � y p 0 = x T β � � 2 � � In the model with y = x (1) β 1 + x (2) β 2 + ξ β = − 1 Tf 3 � � = ztf.tl/- fu 2 � The predicOon for is y p x 0 = 0 1 - It } • = 4 � � 0 � The predicOon for is y p x 0 = 0 0
A linear model with constant offset � The problem with the model y = x (1) β 1 + x (2) β 2 + ξ when x° y is: - � Let’s add a constant offset to the model β 0 y = β 0 + x (1) β 1 + x (2) β 2 + ξ " . ft ftp.t x - - .
Training and prediction with constant offset � The model y = β 0 + x (1) β 1 + x (2) β 2 + ξ = x T β + ξ - r . jjtaant � Training data: � 1 x (1) x (2) � x (1) x (2) y 1 − 3 El � 1 1 3 0 β = ( X T X ) − 1 X T y = 2 1 1 2 3 2 3 1 3 6 5 − 3 � � 0 � � = − 3 y p � For 0 = 1 0 0 2 x 0 = 0 1 3
Comparing our example models y = β 0 + x (1) β 1 + x (2) β 2 + ξ y = x (1) β 1 + x (2) β 2 + ξ � � − 3 2 � � β = β = 2 − 1 1 3 x T � x T � 3 x (1) x (2) x (1) x (2) β y 1 β y y -54=4 1 3 0 1 1 1 3 0 0 2 3 2 3 1 2 3 2 2 - l 1 3 6 5 5 3 6 5 4 l
Variance of the linear regression model � The least squares esOmate saOsfies this property i � var ( { y i } ) = var ( { x T β } ) + var ( { ξ i } ) - X Este e. txt y - � The random error is uncorrelated to the least square soluOon of linear combinaOon of explanatory variables. ' pi = XT x XT y
Variance of the linear regression model: proof � The least squares esOmate saOsfies this property = X ft e Y i � var ( { y i } ) = var ( { x T β } ) + var ( { ξ i } ) var ( Y ) = var ( X f ) t var Ce ) Proof: t z CoV l Xp , e ) xp t e . : CoV l X f , e ) = o
Variance of the linear regression model: proof � The least squares esOmate saOsfies this property i � var ( { y i } ) = var ( { x T β } ) + var ( { ξ i } ) Proof: var [ y ] = (1 /N )([ X ˆ β − X ˆ β ] + [ e − e ]) T ([ X ˆ β − X ˆ β ] + [ e − e ]) var [ y ] = (1 /N )([ X ˆ β − X ˆ β ] T [ X ˆ β − X ˆ β ]+2[ e − e ] T [ X ˆ β − X ˆ β ]+[ e − e ] T [ e − e ]) e T X � Because ; and Due to Least square minimized e T 1 = 0 e = 0 β = 0 var [ y ] = (1 /N )([ X ˆ β − X ˆ β ] T [ X ˆ β − X ˆ β ] + [ e − e ] T [ e − e ]) = var [ X � var [ y ] β ] + var [ e ]
Recommend
More recommend