Machine Learning sanparith.marukatat@nectec.or.th
Today • Example of intelligent system: OCR • k-Nearest Neighbor Classifier • Generative model • Maximum likelihood • Naïve Bayes model • Gaussian model
What is OCR? • Optical Character Recognition – Input: scanned images, photos, vdo images – Output: text file • Alternative input – Electronic stylus e.g. PDA – Online handwritten recognition
OCR process (1) • Preprocessing – Image enhancement, denoise, deskew, ... – Binarization • Layout analysis and character segmentation • Character recognition • Spell correction
Should be removed
Two columns picture
1 line
1 character
OCR process (2) • Preprocessing uses image processing tech. • Layout analysis uses rule-based + some stats. • Character recognition – Classifier (trained from training corpus) – Look-up table: no class -> ascii or UTF code (given by system designer) • Spell correction uses dictionary + some stats + some NLP tech.
How to build the recognition module? (1) • Technical choice – All separated character: Neural Network, SVM – Few touched characters: Class of touched char – Some broken characters ( อำำ ): Class of sub-char – Rule-based segmentation – Several touched chars (e.g. arab, urdu): 2D- HMM
How to build the recognition module? (2) • Normalize character image (reduce variation, get fixed size) • Select features – High-level features: e.g. head, tail • Contain more information • Cannot reliably detected – Low-level features: pixels color, edge • Single feature is not meaningful • Can be easily detected • Can be improved: PCA, LDA, NLDA, ...
How to build the recognition module? (2) • Design class • Build a feature extractor, ex: vector of pixels color • Construct a training corpus – 1 example = 1 vector and 1 class – Very large number of examples – Cover all conditions: dpi, fonts, font sizes, style (e.g. slant, bold), writing styles, pen styles, with and without noises
How to build the recognition module? (3) • Building corpus – Handwritten • Collect sample • Segment from forms or manual segmentation – Printed • Print different fonts, font sizes, ... • Scan, scan of copy, ... • Time consuming
How to build the recognition module? (4) • Select a classifier – Select tools • SNNS or fann for Neural Network • libsvm or svmlight for SVM • weka – Format of training corpus – Parameters and their values – How to use it in your code
What is Neural Network? • Biological inspired multi-class classifier • Set of nodes with oriented connections Multi-Layer Perceptron Diabolo network Recurrent network
Using neural network • Try MLP with 1 hidden layer first • 1 parameter = number of hidden nodes • Training with Gradient descent • 1 training parameter = learning rates
What is SVM? • Linear classifier using kernel trick trained to tradeoff between error and generalization – Linear classifier • output is linear combination of input features • y = sign(w T x) • Use multiple linear classifier for multi-class problem – Kernel trick • Replace all dot product with a kernel function • K(x1,x2) = <g(x1); g(x2)> with some unknown function g
Using SVM • Kernel function – Radial Basis Function (RBF) K(x,y) = exp(- γ ||x-y|| 2 ) – Polynomial K(x,y) = (<x;y>+1) d • Tradeoff parameter C – Small C = generalization is more important than error – Large C = error is more important
Exercise • MNIST dataset with libsvm using RBF kernel • SVM's parameters – gamma = inverse of area of influence around each example, try 0.005 – C = trade off parameter between error on training set and generalization, try 1000
How to build the recognition module? (5) • Multi-pass classifier – Rough classification: upper vowel, mid-level char, lower vowel – Fine classification
How to build the recognition module? (5) • Multi-pass classifier – Rough classification: upper vowel, mid-level char, lower vowel – Fine classification: กถภ , ปฝฟ , ... – Finer classification
1-Nearest Neighbor classifier • Prototype-based classifier, template-based classifier • Distance function • Useful when – We have very limited number of training examples, cannot train other classifier – We have large number of training examples, just to test as a baseline • What is performance of this model? – When n → ∞ , 1NN error < 2 bayes error
Bayes error ● If we do know P(Class 1 |x),...,P(Class M |x), then the Bayes classification rule f(x) = arg max i=1,...,M P(Class i |x) produces the minimum possible expected error = Bayes error P(Y=1|x) P(Y=2|x) Expected error X
k-NN and Bayes classification rule ● P(x) ≈ k / (NV) ● N number of examples in training set ● V volume around x ● k number of points in V ● P(x|y) ≈ k y / (N y V) ● k y number of examples of class y in V ● N y number of examples of class y in training set ● P(y) ≈ N y /N ● P(y|x) ≈ k y / k (why??)
k-NN algorithm • Put the input object into the same class as most of its k nearest neighbors • Implementation – Compute distance between input and each training examples – Sort in increasing order – Count number of instances from each class amongst k nearest neighbors • There is no k which is always optimal
Some useful distances p = ∑ p 1 / p n • Norm-p distance ∣ ∣ x − y ∣ ∣ x i − y i i = 1 • Mahalanobis distance T − 1 x − y dist x , y = x − y • Kernel-based distance dist x , y = K x ,x K y , y − 2K x , y – WHY?
Generative approach (1) • Solving classification problem = build P(class|input) • P(class i |input) = P(input|class i ) P(class i ) / P(input) • P(input) = Σ i P(input|class i ) P(class i ) • P(class i ) = percentage of examples from class i in the training set • Solving classification problem = build P(input|class i ) • P(input|class i ) = likelihood of class i • P(class i |input) = posterior probability of class i
Generative approach (2) • To build P(input|class i ) we usually made an assumption about – How the data from the class i is distributed, e.g. Binomial, Gaussian, Mixture of Gaussian – How the data is created, e.g. HMM • Example: Document classification – How each input i.e. document is represented? – What is the likelihood model for these data?
Document classification (1) • Spam/not-spam • Document = set of words • Preprocessing – word segmentation – remove stop-words – stemming – word selection
Document classification (2) • Naïve Bayes assumption: all words are independent • P(w 1 ,...,w n |Spam) = Π i P(w i |Spam) • Same hypothesis for all classes • How to compute P(w i |Spam), why? • What is the process of building Naïve Bayes model for spam classification?
Maximum Likelihood (1) • x 1 ,...,x N are i.i.d. according to P(x| θ ) where θ is the parameter of this model • Q: What is the proper value for θ ? • A: The value which gives maximum P(x 1 ,...,x N | θ ) • Q: We know P(x| θ ), how to compute P(x 1 ,...,x N | θ )? • A: P(x 1 ,...,x N | θ ) = Π i P(x i | θ ), Why? • Q: How to find the maximum value? • Q: How to get ride of the product?
Maximum Likelihood (2) • Exercise: Binomial distribution – Q: What this means? What is P(w|Spam) – word “viagra” – {T, F, F, T, T, T, F, T, F, T } – Find proper parameter for P(w|Spam)
Maximum Likelihood (3) • Exercise: coin-toss – {H, T, H, H, T, H, H, H, T, H} – Q: What is the parameter of Binomial distribution that fits this data? – Q: What is the conclusion?
Maximum A Posteriori • Sometimes, we have prior knowledge about the model, i.e. we have P( θ ) • We search for maximum P( θ |x 1 ,...,x N ) instead • Q: How to compute P( θ |x 1 ,...,x N ) from P( θ ) and P(x| θ )? • Exercise: coin-toss problem θ is distributed as Gaussian with mean 5/10 and standard deviation 1/10 – Q: What is Gaussian model? – Q: What is the proper value for θ ?
ML or MAP? • ML is good when we have large enough data • MAP is prefered when we have small data • Prior can be estimated from data too
Recommend
More recommend