Lecture 8 n Agenda: n String matching n How to evaluate a pattern recognition system
String Matching (note 1) n n Definitions: n x = ”movi”. Text :”zlatanibrahimovic” n Shift: s = offset from start of text to start position of x n Valid shift: s = offset to a complete match n Applications: Find word in text, count words, etc.
String Matching - Algorithm n Naive string matching: brute force n Ok, but slow for large texts n Alternative: Boyer-Moore string matching n Faster because s = s+k, where k>1 n k=1 for the naive algorithm
Boyer-Moore: Definitions n Algorithm (tavle) n Good suffix: n The elements (from right) which match n Bad character: n The first (from right) wrong element n Calculate the effect of both and apply max.
Boyer-Moore: Definitions n F( x ): Last occurrence function (bad character) n Look-up table containing each letter in the alphabet together with their right-most location in x n Example: x = ”bror”. F( x ): o = 3, r = 2, b = 1, the rest = 0 n Example: x = ”estimates” F( x ): e = 8, t = 7, a = 6, m = 5, i = 4, s = 2, the rest = 0 n NB: note that the right-most element is ignored since this corresponds to the current shift
Boyer-Moore: Definitions n G( x ): Good-suffix function n Look-up table containing the second right-most position of each suffix, which can be (re)found in x n Ex: x = ”bror”. G( x ): r = 2, the rest = 0, hence or = 0, ror = 0 n Ex: x = ”estimates” G( x ): s = 2, es = 1, the rest = 0
Boyer-Moore String Matching
Distance measure for Strings n We know what to do for features… n x = ”hej” y = ”her” z = ”haj” n Dist( x,y ) ?? Dist( x,y ) > Dist( x,z ) ?? n Applications: Spell-checking, speech recognition, DNA analysis, copy-cat detection, … n Hamming distance: | x | = | z | n Measures the number of positions where a difference occurs n Dist( x,y )=1, Dist( y,z )=2, Dist( x,y ) = Dist( x,z )
Distance measure for Strings n Levenshtein distance n | x | = | z | is not required => better measure n Aka Edit distance, since the distance is defined as the number of operations that need to be preformed on x in order to obtain y
Edit Distance (change x to y ) n Cost matrix: C (1.row, 1.col., hereafter one col. at a time) C [i,j] = min[ C [i-1,j] +1 , C [i,j-1] +1 , C [i-1,j-1] +1 – δ (x[i],y[j]) ] n insertion deletion No change / exchange δ (x[i],y[j])= 1 if x[i]=y[j] otherwise 0
Recognition rate n In some system specifications you need technical success criteria for your project (product) n HW, SW, Real-time, recognition rate,… n Recognition rate = (number of correct classified / number of tested samples ) n Multiply by 100% and you have it in percentages n How do you test a system? n How do you present and interpret the results?
Methods for test n Cross-validation n Train on α % of the samples ( α > 50 ) and test on the rest n α is typically 90, depending on the number of samples and the complexity of the system n M-fold cross validation n Divide (randomly) all samples in M equally sized groups n Use M-1 groups to train the system and test on the rest n Do this M times and average the results
Interpretation of the results n Recognition rate = ( number of correct classified / number of tested samples ) Multiply by 100% and you have it in percentages n n Error % = 100% - ( Recognition rate x 100% ) n Distribution of errors? n Confusion matrix Output (from the system) n 3 classes P1 P2 P3 n 25 samples per class P1 19 5 1 Input (the truth) P2 0 24 1 P3 1 4 20
General Representation of errors n Number of errors = Incorrect recognized + Not recognized n The total number of errors can be represented like this: Output (from the system) Yes No Yes Not recognized Input (Type II error) (the truth) No (False negativ = FN) (False reject = FR) (False reject rate = FRR) (Miss) Incorrect recognized (Type I error) (False positiv = FP) (False accept = FA) (False accept rate = FAR) (Ghost object) (False alarm)
General Representation of errors • Example: SETI • Find intelligent signals in input data • FN versus FP – are they equally important? Output (from the system) Yes No Yes No !! Not recognized Input (Type II error) (the truth) No (False negativ = FN) Ok (False reject = FR) (False reject rate = FRR) (Miss) Incorrect recognized (Type I error) (False positiv = FP) (False accept = FA) (False accept rate = FAR) (Ghost object) (False alarm)
General Representation of errors • Example: Access control to nuclear weapons • Is the person trying to enter ok? • FN versus FP – are they equally important? Output (from the system) Yes No Yes Ok Not recognized Input (Type II error) (the truth) No (False negativ = FN) No !! (False reject = FR) (False reject rate = FRR) (Miss) Incorrect recognized (Type I error) (False positiv = FP) (False accept = FA) (False accept rate = FAR) (Ghost object) (False alarm)
Receiver Operating Characteristic Methodology
Introduction to ROC curves • ROC = Receiver Operating Characteristic • Started in electronic signal detection theory (1940s - 1950s) • Has become very popular in biomedical applications, particularly radiology and imaging • Also used in machine learning applications to assess classifiers • Can be used to compare tests/procedures
ROC curve
ROC curves: simplest case • Consider diagnostic test for a disease • Test has 2 possible outcomes: – ‘ postive ’ = suggesting presence of disease – ‘ negative ’ • An individual can test either positive or negative for the disease • Prof. Mean...
True disease state vs. Test result Test not rejected rejected Disease J X No disease (D = 0) specificity Type I error (False +) α X J Disease (D = 1) Type II error Power 1 - β ; (False -) β sensitivity
Specific Example Pts without Pts with the disease disease Test Result
Threshold Call these patients “ negative ” Call these patients “ positive ” Test Result
Some definitions ... Call these patients “ negative ” Call these patients “ positive ” True Positives Test Result without the disease with the disease
Call these patients “ negative ” Call these patients “ positive ” False Test Result Positives without the disease with the disease
Call these patients “ negative ” Call these patients “ positive ” True negatives Test Result without the disease with the disease
Call these patients “ negative ” Call these patients “ positive ” False negatives Test Result without the disease with the disease
Moving the Threshold: right ‘‘ ‘‘ - ’’ + ’’ Test Result without the disease with the disease
Moving the Threshold: left ‘‘ ‘‘ - ’’ + ’’ Test Result without the disease with the disease
ROC curve 100% True Positive Rate (sensitivity) 0% 100% 0% False Positive Rate (1-specificity)
ROC curve comparison A poor test: A good test: 100% 100% True Positive Rate True Positive Rate 0 0 % % 100% 100% 0 0 False Positive Rate False Positive Rate % %
ROC curve extremes Best Test: Worst test: 100% 100% True Positive Rate True Positive Rate 0 0 % 100 % 0 100 False Positive 0 % % False Positive % % Rate Rate The distributions The distributions don ’ t overlap at all overlap completely
Area under ROC curve (AUC) • Overall measure of test performance • Comparisons between two tests based on differences between (estimated) AUC • For continuous data, AUC equivalent to Mann- Whitney U-statistic (nonparametric test of difference in location between two populations)
AUC for ROC curves 100% 100% AUC = 100% True Positive Rate True Positive Rate AUC = 50% 0 0 % 100 % 0 100 False Positive 0 % % False Positive % % Rate Rate 100% 100% AUC = 90% True Positive True Positive AUC = 65% Rate Rate 0 0 % % 100 0 100 0 False Positive % False Positive % % % Rate Rate
K-fold Cross-Validation • Randomly sort data • Divide into k folds (e.g. k=10) • Use one fold for validation and the remaining for training • Average the accuracy 35
Recommend
More recommend