Statistical Natural Language Processing Machine learning: evaluation Çağrı Çöltekin University of Tübingen Seminar für Sprachwissenschaft Summer Semester 2017
ML evaluation Measuring success/failure in regression Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, outcome variable 1 / 20 Root mean squared error (RMSE) y y 2 y 3 ˆ y 1 � n � � 1 ∑ � ( y i − ˆ y i ) 2 RMSE = y 2 ˆ n i y 1 ˆ y 3 x • Measures average error in the units compatible with the
ML evaluation Measuring success/failure in regression Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, 2 / 20 Coeffjcient determination y y i ∑ n y ) 2 i ( ˆ y i − ¯ R 2 = ∑ n y ) 2 i ( y i − ¯ ) 2 y i ˆ ( RMSE = 1 − ¯ y σ y y ¯ x • r 2 is a standardized measure in range [ 0 , 1 ] • Indicates the ratio of variance of y explained by x • For single predictor it is the square of the correlation coeffjcient r
ML evaluation Measuring success in classifjcation Accuracy of the error function correct total number of predictions Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 3 / 20 • In classifjcation, we do not care (much) about the average • We are interested in how many of our predictions are • Accuracy measures this directly accuracy = number of correct predictions
ML evaluation Accuracy may go wrong an empty document set (no results found) the accuracy is: In general, if our class distribution is skewed accuracy will be a bad indicator of success Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 4 / 20 • Think about a ‘dummy’ search engine that always returns • If we have – 1 000 000 documents – 1000 relevant documents (including the term in the query)
ML evaluation Accuracy may go wrong an empty document set (no results found) the accuracy is: be a bad indicator of success Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 4 / 20 • Think about a ‘dummy’ search engine that always returns • If we have – 1 000 000 documents – 1000 relevant documents (including the term in the query) 999 000 1 000 000 = 99 . 90 % • In general, if our class distribution is skewed accuracy will
ML evaluation true value Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, predicted neg. pos. negative Measuring success in classifjcation positive Precision, recall, F-score 5 / 20 TP precision = TP + FP TP recall = TP + FN TP FP F 1 -score = 2 × precision × recall FN TN precision + recall
ML evaluation Example: back to the search engine Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, the choice of the ‘positive’ class is important. Precision and recall are asymmetric, 6 / 20 queries • We had a ‘dummy’ search engine that returned false for all • For a query – 1 000 000 documents – 1000 relevant documents accuracy = 999 000 1 000 000 = 99 . 90 % 0 precision = 1 000 000 = 0 % 0 recall = 1 000 000 = 0 %
ML evaluation 3 Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, and F-score and Recall and Precision Accuracy both predicted 7 9 1 Classifjer evaluation: another example negative positive true value 1 3 neg. 9 7 pos. negative positive true value Consider the following two classifjers: 7 / 20
ML evaluation positive Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, predicted 7 9 3 1 Classifjer evaluation: another example negative true value 1 3 neg. 9 7 pos. negative positive true value Consider the following two classifjers: 7 / 20 Accuracy both 8/20 = 0 . 4 Precision 7/16 = 0 . 44 and 1/4 = 0 . 25 Recall 7/10 = 0 . 7 and 1/10 = 0 . 1 F-score 0 . 54 and 0 . 14
ML evaluation Multi-class evaluation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, there is no natural positive class 8 / 20 precision/recall/f-score • For multi-class problems, it is common to report average • For C classes, averaging can be done two ways: ∑ C TP i ∑ C TP i i i TP i + FP i TP i + FN i precision M = recall M = C C ∑ C ∑ C i TP i i TP i precision µ = recall µ = ∑ C ∑ C i TP i + FP i i TP i + FN i ( M = macro, µ = micro) • The averaging can also be useful for binary classifjcation, if
ML evaluation 2 Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, predicted 7 7 0 c 8 Confusion matrix 12 b 4 3 10 a c b a true class classifjcation tasks 9 / 20 • A confusion matrix is often useful for multi-class • Are the classes balanced? • What is the accuracy? • What is per-class, and averaged precision/recall?
ML evaluation is another indication of Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, precision recall Precision–recall trade-ofg success of a classifjer 10 / 20 correct models useful for picking the decreasing recall hyperparameter) results in by changing a 1 • Increasing precision (e.g., 0 . 5 • Precision–recall graphs are • Area under the curve (AUC) 0 0 0 . 5 1
ML evaluation Performance metrics a summary class distribution is skewed straightforward, but others measures need averaging use/report the metric that is useful for the purpose Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 11 / 20 • Accuracy does not refmect the classifjer performance when • Precision and recall are binary and asymmetric • For multi-class problems, calculating accuracy is • These are just the most common measures: there are more • You should understand what these metrics measure, and
ML evaluation Model selection/evaluation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, hyperparameters (e.g., regularization constant) model 12 / 20 not overlap with the training data models tend to fjt to the noise in the training data training data • Our aim is to fjt models that are (also) useful outside the • Evaluating a model on the training data is wrong: complex • The results should always be tested on a test set that does • Test set is ideally used only once - to evaluate the fjnal • Often, we also need to tune the model, e.g., to tune • Tuning has to be done on a separate development set
ML evaluation 600 Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, y x 1000 800 400 Back to polynomial regression 200 0 10 8 6 4 2 13 / 20
ML evaluation Back to polynomial regression Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, y x 1000 800 600 400 200 0 10 8 6 4 2 13 / 20 y = − 221 . 3 + 109 . 9x
ML evaluation Back to polynomial regression Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, y x 1000 800 600 400 200 0 10 8 6 4 2 13 / 20 y = − 221 . 3 + 109 . 9x y = 45 . 50 − 3 . 52x + 12 . 13x 2
ML evaluation 600 Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, y Back to polynomial regression 1000 800 x 400 200 2 4 6 13 / 20 10 8 0 y = − 221 . 3 + 109 . 9x y = 45 . 50 − 3 . 52x + 12 . 13x 2 + 2604 . 21x 2 y = 1445 . 80 − 3189 . 13x − 1026 . 76x 3 + 218 . 40x 4 − 25 . 52x 5 + 1 . 54x 6
ML evaluation Training/test error polynomial degree error Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 14 / 20 15 10 5 2 4 6 8 10
ML evaluation the squared deviations from the mean estimate Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, models with low bias result in high variance. Bias–variance relationship is a trade-ofg: Bias and variance (revisited) 15 / 20 Variance of an estimate is, simply its variance, the value of estimate being estimated, and the expected value of the Bias of an estimate is the difgerence between the value B ( ˆ w ) = E [ ˆ w ] − w • An unbiased estimator has 0 bias [ w ]) 2 ] var ( ˆ ( ˆ w − E [ ˆ w ) = E w is the parameters that defjne the model
ML evaluation Some issues with bias and variance idiosyncrasies of the training data for the data at hand Complex models tend to overfjt – and exhibit high variance Simple models tend to show low variance, but likely to have (high) bias Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 16 / 20 • Overfjtting occurs when the model learns the • Underfjtting occurs when the model is not fmexible enough
ML evaluation Some issues with bias and variance idiosyncrasies of the training data for the data at hand have (high) bias Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 16 / 20 • Overfjtting occurs when the model learns the • Underfjtting occurs when the model is not fmexible enough • Complex models tend to overfjt – and exhibit high variance • Simple models tend to show low variance, but likely to
ML evaluation Cross validation development set both training and tuning with some additional efgort ‘average’ parameter estimates over multiple folds Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 17 / 20 • To avoid overfjtting, we want to tune our models on a • But (labeled) data is valuable • Cross validation is a technique that uses all the data, for • Besides tuning hyper-parameters, we may also want to get • We may also use cross-validation during testing
Recommend
More recommend