CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu January 19, 2016
Announcements • Team formation due next Wednesday • Homework 1 out by tomorrow 2
Today’s Schedule • Course Project Introduction • Linear Regression Model • Decision Tree 3
Methods to Learn Matrix Data Text Set Data Sequence Time Series Graph & Images Data Data Network Classification Decision Tree ; HMM Label Neural Naïve Bayes; Propagation Network Logistic Regression SVM; kNN Clustering K-means; PLSA SCAN; hierarchical Spectral clustering; Clustering DBSCAN; Mixture Models; kernel k- means* Frequent Apriori; GSP; FP-growth PrefixSpan Pattern Mining Linear Regression Autoregression Collaborative Prediction Filtering Similarity DTW P-PageRank Search Ranking PageRank 4
How to learn these algorithms? • Three levels • When it is applicable? • Input, output, strengths, weaknesses, time complexity • How it works? • Pseudo-code, work flows, major steps • Can work out a toy problem by pen and paper • Why it works? • Intuition, philosophy, objective, derivation, proof 5
Matrix Data: Prediction • Matrix Data • Linear Regression Model • Model Evaluation and Selection • Summary 6
Example A matrix of n × 𝑞 : x ... x ... x 11 1f 1p • n data objects / points ... ... ... ... ... • p attributes / dimensions x ... x ... x i1 if ip ... ... ... ... ... x ... x ... x n1 nf np 7
Attribute Type • Numerical • E.g., height, income • Categorical / discrete • E.g., Sex, Race 8
Categorical Attribute Types • Nominal: categories, states, or “names of things” • Hair_color = { auburn, black, blond, brown, grey, red, white } • marital status, occupation, ID numbers, zip codes • Binary • Nominal attribute with only 2 states (0 and 1) • Symmetric binary: both outcomes equally important • e.g., gender • Asymmetric binary: outcomes not equally important. • e.g., medical test (positive vs. negative) • Convention: assign 1 to most important outcome (e.g., HIV positive) • Ordinal • Values have a meaningful order (ranking) but magnitude between successive values is not known. • Size = { small, medium, large } , grades, army rankings 9
Matrix Data: Prediction • Matrix Data • Linear Regression Model • Model Evaluation and Selection • Summary 10
Linear Regression • Ordinary Least Square Regression • Closed form solution • Gradient descent • Linear Regression with Probabilistic Interpretation 11
The Linear Regression Problem • Any Attributes to Continuous Value: x ⇒ y • {age; major ; gender; race} ⇒ GPA • {income; credit score; profession} ⇒ loan • {college; major ; GPA} ⇒ future income • ... 12
Illustration 13
Formalization • Data: n independent data objects • 𝑧 𝑗 , i = 1, … , 𝑜 T , i = 1, … , 𝑜 • 𝒚 𝑗 = 𝑦 𝑗0 , 𝑦 𝑗1 , 𝑦 𝑗2 , … , 𝑦 𝑗𝑞 • A constant factor is added to model the bias term , i. e. , 𝑦 𝑗0 = 1 • Model: • 𝑧: dependent variable • 𝒚: explanatory variables 𝑈 : 𝑥𝑓𝑗ℎ𝑢 𝑤𝑓𝑑𝑢𝑝𝑠 • 𝜸 = 𝛾 0 , 𝛾 1 , … , 𝛾 𝑞 • 𝑧 = 𝒚 𝑈 𝜸 = 𝛾 0 + 𝑦 1 𝛾 1 + 𝑦 2 𝛾 2 + ⋯ + 𝑦 𝑞 𝛾 𝑞 14
A 2-step Process • Model Construction • Use training data to find the best parameter 𝜸 , , denoted as 𝜸 • Model Usage • Model Evaluation • Use validation data to select the best model • Feature selection • Apply the model to the unseen data (test data): 𝑧 = 𝒚 𝑈 𝜸 15
Least Square Estimation • Cost function (Total Square Error): 2 𝑈 𝜸 − 𝑧 𝑗 • 𝐾 𝜸 = 𝑗 𝒚 𝑗 • Matrix form: • 𝐾 𝜸 = X𝜸 − 𝒛 𝑈 (𝑌𝜸 − 𝒛) 2 or X𝜸 − 𝒛 1 , x ... x ... x 𝑧 1 11 1f 1p ... ... ... ... ... ⋮ 𝑧 𝑗 1 , x ... x ... x i1 if ip ⋮ ... ... ... ... ... 𝑧 𝑜 1 , x ... x ... x n1 nf np 𝒀: 𝒐 × 𝒒 + 𝟐 matrix y : 𝒐 × 𝟐 𝐰𝐟𝐝𝐮𝐩𝐬 16
Ordinary Least Squares (OLS) • Goal: find 𝜸 that minimizes 𝐾 𝜸 • 𝐾 𝜸 = X𝜸 − 𝑧 𝑈 𝑌𝜸 − 𝑧 = 𝜸 𝑈 𝑌 𝑈 𝑌𝜸 − 𝑧 𝑈 𝑌𝜸 − 𝜸 𝑈 𝑌 𝑈 𝑧 + 𝑧 𝑈 𝑧 • Ordinary least squares • Set first derivative of 𝐾 𝜸 as 0 𝜖𝐾 𝜖𝜸 = 2𝜸 𝑈 X T X − 2𝑧 𝑈 𝑌 = 0 • −1 𝑌 𝑈 𝑧 • ⇒ 𝜸 = 𝑌 𝑈 𝑌 17
Gradient Descent • Minimize the cost function by moving down in the steepest direction 18
Batch Gradient Descent • Move in the direction of steepest descend Repeat until converge { 𝜖𝐾 𝜸 (𝑢+1) := 𝜸 (t) − 𝜃 𝜖𝜸 𝜸=𝜸 (t) , e.g., 𝜃 = 0.1 } 2 = 𝑗 𝐾 𝑗 (𝜸) and 𝑈 𝜸 − 𝑧 𝑗 Where 𝐾 𝜸 = 𝑗 𝒚 𝑗 𝜖𝐾 𝜖𝐾 𝑗 𝑈 𝜸 − 𝑧 𝑗 ) 𝜖𝜸 = 𝜖𝜸 = 2𝒚 𝑗 (𝒚 𝑗 𝑗 𝑗 19
Stochastic Gradient Descent • When a new observation, i , comes in, update weight immediately (extremely useful for large- scale datasets): Repeat { for i=1:n { 𝜸 (𝑢+1) := 𝜸 (t) + 2𝜃(𝑧 𝑗 − 𝒚 𝑗 𝑈 𝜸 (𝑢) )𝒚 𝑗 } } If the prediction for object i is smaller than the real value, 𝜸 should move forward to the direction of 𝒚 𝑗 20
Other Practical Issues • What if 𝑌 𝑈 𝑌 is not invertible? • Add a small portion of identity matrix, λ 𝐽 , to it (ridge regression* ) • What if some attributes are categorical? • Set dummy variables • E.g., 𝑦 = 1, 𝑗𝑔 𝑡𝑓𝑦 = 𝐺; 𝑦 = 0, 𝑗𝑔 𝑡𝑓𝑦 = 𝑁 • Nominal variable with multiple values? • Create more dummy variables for one variable • What if non-linear correlation exists? • Transform features, say, 𝑦 to 𝑦 2 21
Probabilistic Interpretation • Review of normal distribution 2𝜌𝜏 2 𝑓 − 𝑦−𝜈 2 1 • X ~𝑂 𝜈, 𝜏 2 ⇒ 𝑔 𝑌 = 𝑦 = 2𝜏2 22
Probabilistic Interpretation 𝑈 𝛾 + ε 𝑗 • Model: 𝑧 𝑗 = 𝑦 𝑗 • ε 𝑗 ~𝑂(0, 𝜏 2 ) 𝑈 𝛾, 𝜏 2 ) • 𝑧 𝑗 𝑦 𝑗 , 𝛾~𝑂(𝑦 𝑗 𝑈 𝛾 • 𝐹 𝑧 𝑗 𝑦 𝑗 = 𝑦 𝑗 • Likelihood: • 𝑀 𝜸 = 𝑗 𝑞 𝑧 𝑗 𝑦 𝑗 , 𝛾) 2 𝑈 𝜸 𝑧 𝑗 −𝒚 𝑗 1 = 𝑗 2𝜌𝜏 2 exp{− } 2𝜏 2 • Maximum Likelihood Estimation • find 𝜸 that maximizes L 𝜸 • arg max 𝑀 = arg min 𝐾 , Equivalent to OLS! 23
Matrix Data: Prediction • Matrix Data • Linear Regression Model • Model Evaluation and Selection • Summary 24
Model Selection Problem • Basic problem: • how to choose between competing linear regression models • Model too simple: • “ underfit ” the data; poor predictions; high bias; low variance • Model too complex: • “ overfit ” the data; poor predictions; low bias; high variance • Model just right: • balance bias and variance to get good predictions 25
Bias and Variance True predictor 𝑔 𝑦 : 𝑦 𝑈 𝜸 • Bias: 𝐹( 𝑔 𝑦 ) − 𝑔(𝑦) Estimated predictor 𝑔 𝑦 : 𝑦 𝑈 𝜸 • How far away is the expectation of the estimator to the true value? The smaller the better. 2 = 𝐹[ • Variance: 𝑊𝑏𝑠 𝑔 𝑦 𝑔 𝑦 − 𝐹 𝑔 𝑦 ] • How variant is the estimator? The smaller the better. • Reconsider mean square error 2 /𝑜 • 𝐾 𝑈 𝜸 /𝑜 = 𝑗 𝒚 𝑗 𝜸 − 𝑧 𝑗 • Can be considered as 2 ] = 𝑐𝑗𝑏𝑡 2 + 𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓 + 𝑜𝑝𝑗𝑡𝑓 • 𝐹[ 𝑔 𝑦 − 𝑔(𝑦) − 𝜁 Note 𝐹 𝜁 = 0, 𝑊𝑏𝑠 𝜁 = 𝜏 2 26
Bias-Variance Trade-off 27
Cross-Validation • Partition the data into K folds • Use K-1 fold as training, and 1 fold as testing • Calculate the average accuracy best on K training-testing pairs • Accuracy on validation/test dataset! 2 /𝑜 𝑈 • Mean square error can again be used: 𝑗 𝒚 𝑗 𝜸 − 𝑧 𝑗 28
AIC & BIC* • AIC and BIC can be used to test the quality of statistical models • AIC (Aka kaike ike information formation criterion erion) • 𝐵𝐽𝐷 = 2𝑙 − 2ln( 𝑀) , • where k is the number of parameters in the model and 𝑀 is the likelihood under the estimated parameter • BIC (Bayesian Information criterion) • B 𝐽𝐷 = 𝑙𝑚𝑜(𝑜) − 2ln( 𝑀) , • Where n is the number of objects 29
Stepwise Feature Selection • Avoid brute-force selection • 2 𝑞 • Forward selection • Starting with the best single feature • Always add the feature that improves the performance best • Stop if no feature will further improve the performance • Backward elimination • Start with the full model • Always remove the feature that results in the best performance enhancement • Stop if removing any feature will get worse performance 30
Matrix Data: Prediction • Matrix Data • Linear Regression Model • Model Evaluation and Selection • Summary 31
Recommend
More recommend