ma c hine l e a rning with mat l ab c la ssific a tion
play

Ma c hine L e a rning with MAT L AB - - c la ssific a tion - PDF document

7/27/2017 Ma c hine L e a rning with MAT L AB - - c la ssific a tion Stanley Liang, PhD York University Classific atio n the de finitio n In machine learning and statistics, classification is the problem of identifying to which of a


  1. 7/27/2017 Ma c hine L e a rning with MAT L AB - - c la ssific a tion Stanley Liang, PhD York University Classific atio n the de finitio n • In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub ‐ populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known • Steps for classification 1. Data prepare – preprocessing, creating training / test set 2. Training 3. Cross Validation 4. Model deployment 1

  2. 7/27/2017 Our data se t • Titanic disaster dataset • Iris dataset • 150 rows • 891 rows • Multi ‐ class (3) classification • Binary classification • Features / predictors • Features / predictors – Sepal Length – Class: cabin class – Sepal Width – Sex: gender of the passenger – Age – Petal Length – Fare – Petal Width • Label / response • Label / response – Species ‐‐ string – Survived: 0 ‐ dead, 1 ‐ survived Our data se t • Pima Indians Diabetes Data (NIDDK) • Wholesale Customers • 440 rows • 768 rows • Binary / multiclass (2 categorical) • Binary classification – diabetes or not • Continuous variables (6): the monetary units • Features / predictors ‐ 8 (m.u.) spent on the products – preg: # of pregnant times – Fresh ‐ fresh products – plas: plasma glucose concentration – Milk ‐ diary products – pres: diastolic BP (mmHg) – Grocery ‐ grocery products – Frozen ‐ frozen products – skin: triceps skinfold thickness (mm) – Detergents_Paper ‐ detergents and paper – test: 2 ‐ Hour serum insulin (mu U/ml) products – mass: body mass index – Delicatessen ‐ delicatessen products – pedi: diabetes pedigree function (numeric) • Categorical variables (2) – age – Channel: 1 ‐ Horeca, 2 ‐ Retail • Label / response: 1 ‐ diabetes, 0 ‐ no – Region: 1 ‐ Lisbon, 2 ‐ Oporto, 3 ‐ Other 2

  3. 7/27/2017 T he wo rkflo w o f Classific atio n Optimizing a mo de l • Because of the prior knowledge you have about the data or after looking at the classification results, you may want to customize the classifier. • You can update and customize the model by setting different options using the fitting functions. • Set the options by providing additional inputs for the option name and the option value. • model=fitc*(tbl, ʹ response ʹ , ʹ optionName ʹ ,optionValue) – ʹ optionName ʹ ‐‐ Name of the option, e.g., ʹ Cost ʹ . – optionValue ‐‐ Value to be set to the option specified, e.g., [0 10; 2 0] ‐‐ change the Cost Matrix 3

  4. 7/27/2017 k-Ne are st Ne ig hbo r Ove rvie w • Function – fitcknn • Performance – Fit Time: fast – Prediction Time: fast, ∝ (Data Size)^2 – Memory Overhead: Small – Common Properties: – ʹ NumNeighbors ʹ – Number of neighbors used for classification. – ʹ Distance ʹ – Metric used for calculating distances between neighbors. – ʹ DistanceWeight ʹ – Weighting given to different neighbors. – Special Notes – For normalizing the data, use the ʹ Standardize ʹ option. –1 – The cosine distance metric works well for “wide” data (more predictors than observations) and data with many predictors. De c isio n T re e s • Function – fitctree • Performance – Fit Time ‐ ∝ Size of the data – Prediction Time – Fast – Memory Overhead – small • Common Properties – ʹ SplitCriterion ʹ – Formula used to determine optimal splits at each level – ʹ MinLeafSize ʹ – Minimum number of observations in each leaf node. – ʹ MaxNumSplits ʹ – Maximum number of splits allowed in the decision tree. • Special Notes – Trees are a good choice when there is a significant amount of missing data. 4

  5. 7/27/2017 Naïve Baye s • Function – fitcnb • k ‐ NN and decision trees do not make any assumptions about the distribution of the • Performance underlying data. • Fit Time: • If we assume that the data comes from a – Normal Dist. ‐ Fast; Kernel Dist. – Slow certain underlying distribution, we can • Prediction Time: treat the data as a statistical sample. This – Normal Dist. ‐ Fast; Kernel Dist. – Slow can reduce the influence of the outliers on our model. • Memory Overhead: • A naïve Bayes classifier assumes the – Normal Dist. ‐ Small; Kernel Dist. ‐ Moderate to large independence of the predictors within each class. This classifier is a good choice • Common Properties for relatively simple problems. – ʹ Distribution ʹ – Distribution used to calculate probabilities – ʹ Width ʹ – Width of the smoothing window (when ʹ Distribution ʹ is set to ʹ kernel ʹ ) – ʹ Kernel ʹ – Type of kernel to use (when ʹ Distribution ʹ is set to ʹ kernel ʹ ). – Special Notes – Naive Bayes is a good choice when there is a significant amount of missing data. Disc riminant Analysis • Linear Discriminant Analysis • Similar to naive Bayes, discriminant analysis works by assuming that the observations in each prediction class – The default classification assumes that the can be modeled with a normal probability distribution. covariance for each response class is assumed to • There is no assumption of independence in each be the same. This results in linear boundaries between classes. predictor. – DaModel = fitcdiscr(dataTrain, ʹ response ʹ ); • A multivariate normal distribution is fitted to each class. • Quadratic Discriminant Analysis • Fit Time: Fast; ∝ size of the data – Give up equal covariance assumption, a • Prediction Time: Fast; ∝ size of the data quadratic boundary will be drawn between classes • Memory Overhead: Linear DA ‐ Small; Quadratic DA ‐ – daModel = Moderate to large; ∝ number of predictors fitcdiscr(dataTrain, ʹ response ʹ , ʹ DiscrimType ʹ , ʹ quadra tic ʹ ); • Common Properties ‐ ʹ DiscrimType ʹ ‐ Type of boundary used. ‐ ʹ Delta ʹ ‐ Coefficient threshold for including predictors in a linear boundary. (Default 0.) ‐ ʹ Gamma ʹ ‐ Regularization to use when estimating the covariance matrix for linear DA. • Linear discriminant analysis works well for “wide” data (more predictors than observations). 5

  6. 7/27/2017 Suppo rt Ve c to r Mac hine s • Multiclass Support Vector Machines • SVM will calculate the closes boundary that can correctly separate different groups of data – The underlying calculations for classification with support vector • Fit Time: Fast; ∝ square of the size of the data machines are binary by nature. You • Prediction Time: Very Fast; ∝ square of the size of the can perform multiclass SVM data classification by creating an error ‐ correcting output codes (ECOC) • Memory Overhead: Moderate classifier. • ʹ KernelFunction ʹ – Variable transformation to apply. – First, Create a template for a binary classifier • ʹ KernelScale ʹ – Scaling applied before the kernel – Second, Create multiclass SVM transformation. classifier – Use the function fitecoc • ʹ BoxConstraint ʹ – Regularization parameter controlling to create a multiclass SVM the misclassification penalty classifier. • SVMs use a distance based algorithm. For data is not normalized, use the ʹ Standardize ʹ option. • Linear SVMs work well for “wide” data (more predictors than observations). Gaussian SVMs often work better on “tall” data (more observations than predictors). Cro ss Validatio n • To compare model performance, we can calculate the loss for each method and pick the method with minimum loss. • The loss is calculated on a specific test data. It is possible that a learning algorithm performs well on that particular test data but does not generalize well to other data • The general idea of cross validation is to repeat the above process by creating different training and test data, fit the model to each training data, and calculate the loss using the corresponding test data. 6

  7. 7/27/2017 K e ywo rd – value pairs fo r c ro ss validatio n • mdl = fitcknn(data, ʹ responseVarName ʹ , ʹ optionName ʹ , ʹ optionValue ʹ ) • ‐‐ ʹ CrossVal ʹ : ʹ on ʹ ‐‐ 10 ‐ fold cross validation ‐‐ ʹ Holdout ʹ : scalar from 0 to 1 ‐‐ Holdout with the given fraction reserved for validation. ‐‐ ʹ KFold ʹ : k (scalar) ‐‐ k ‐ fold cross validation ‐‐ ʹ Leaveout ʹ : ʹ on ʹ ‐‐ Leave ‐ one ‐ out cross validation • if you already have a partition created using the cvpartition function, you can also provide that to the fitting function. • >> part = cvpartition(y, ʹ KFold ʹ ,k); >> mdl = fitcknn(data, ʹ responseVarName ʹ , ʹ CVPartition ʹ ,part); • To evaluate a cross ‐ validated model, use the kfoldLoss function to compute the loss • >> kfoldLoss(mdl) Strate g ie s to re duc e pre dic to rs • High ‐ dimensional Data • Machine learning problems often involve high ‐ dimensional data with hundreds or thousands of predictors, e.g. Facial recognition, Predicting weather • Learning algorithms are often computation intensive and reducing the number of predictors can have significant benefits in calculation time and memory consumption. • Reducing the number of predictors results in simpler models which can be generalized and are easier to interpret. • Two common ways: – Feature transformation ‐‐ Transform the coordinate space of the observed variables. – Feature selection ‐‐ Choose a subset of the observed variables 7

Recommend


More recommend