PRMU Classifiers that improve with use Argument George Nagy In-house training sets are never DocLab large enough, and never representative enough. Rensselaer Polytechnic Institute We must therefore augment them with samples from actual (real-time, real-world) OCR operation. We present some methods to this end. Febuary 19-20, 2004 IEICE-PRMU George Nagy 1 Febuary 19-20, 2004 IEICE-PRMU George Nagy 2 Outline Representation Non-representative training sets x 2 equiprobability contours Supervised learning (continuing classifier education) “Unsupervised” adaptation X X O O Self-corrective, Decision-directed, Auto-label X X X X X OOO O Symbolic Indirect Correlation (SIC) new *** O O O X X X OO Style-constrained classification X samples Weakly-constrained data distributions ( new ***) decision x 1 boundary Linguistic context Feature Space Recommendations of two features Febuary 19-20, 2004 IEICE-PRMU George Nagy 3 Febuary 19-20, 2004 IEICE-PRMU George Nagy 4 Traditional open-loop OCR System How representative is the training set? training patterns and labels training test set meta-parameters (e.g. regularization, estimators) (3) discrete parameter estimation styles transcript classifier labels correction, parameters CLASSIFIER (1) representative (5) weakly reject entry rejects constrained operational patterns (2) adaptable data ( long fields ) (bitmaps) (4) continuous styles ( short fields ) Febuary 19-20, 2004 IEICE-PRMU George Nagy 5 Febuary 19-20, 2004 IEICE-PRMU George Nagy 6 1
Supervised learning Some classifiers Generic OCR System that makes use of post-processed rejects and errors GAUSSIAN keyboarded labels of rejects and errors training LINEAR QUADRATIC set BAYES MULTILAYER NEURAL meta-parameters NETWORK parameter estimation classifier transcript correction, parameters CLASSIFIER reject entry SIMPLE PERCEPTRON operational data NEAREST SUPPORT (bitmaps) NEIGHBOR VECTOR Febuary 19-20, 2004 IEICE-PRMU George Nagy 7 Febuary 19-20, 2004 IEICE-PRMU George Nagy 8 MACHINE Adaptation ( DHS: “Decision directed approximation”) Self-corrective recognition (1966) Field estimation, singlet classification INTIAL REFERENCES NEW REFERENCES classifier assigned labels training set FEATURE EXTRACTOR meta-parameters parameter estimation accepted CATEGORIZER REFERENCE GENERATOR classifier transcript correction, rejected parameters CLASSIFIER reject entry SCANNER SOURCE operational DOCUMENT data (bitmaps) Febuary 19-20, 2004 IEICE-PRMU George Nagy 9 Febuary 19-20, 2004 IEICE-PRMU George Nagy 10 Decision-directed adaptation Results: self-corrective recognition aka self-corrective recognition, auto-label adaptation, (Shelton & Nagy 1966) semi-supervised learning, .... z 7 Training set: 9 fonts, 500 characters/font, U/C 1 7 Test set: 12 fonts, 1500 characters/font, U/C 96 n-tuple features, ternary reference vectors 7 1 adapted to a single font Omnifont Initial error and reject rates: 3.5% 15.2% classifier 1 After self correction: 0.7% 3.7% 1 7 Febuary 19-20, 2004 IEICE-PRMU George Nagy 11 Febuary 19-20, 2004 IEICE-PRMU George Nagy 12 2
Results: adapting both means and variances Results - Baird & Nagy (DR&R 1994) (Harsha Veeramachaneni 2003) 100 fonts, 80 symbols each from Baird’s defect model NIST Hand-printed digit classes, with 50 “Hitachi features” (6,400,000 characters) Train Test % Error Adapt Adapt Size (pt) Error % fonts Best Worst Before means variance reduction improved SD3 SD3 1.1 0.7 0.6 6 x 1.4 100 x 4 x 1.0 SD7 5.0 2.6 2.2 10 x 2.5 93 x 11 x 0.8 SD7 SD3 1.7 0.9 0.8 SD7 2.4 1.6 1.7 12 x 4.4 98 x 34 x 0.9 SD3+SD7 SD3 0.9 0.6 0.6 16 x 7.2 98 x 141 x0.8 SD7 3.2 1.9 1.8 Febuary 19-20, 2004 IEICE-PRMU George Nagy 13 Febuary 19-20, 2004 IEICE-PRMU George Nagy 14 From electronic ink to feature string InkLink Adnan El-Nasan (2003) On-line handwriting recognition Constrained localized polygram matching one unknown word against many reference words, using a lexicon of legal words. The reference set does not include most of the lexicon words! Febuary 19-20, 2004 IEICE-PRMU George Nagy 15 Febuary 19-20, 2004 IEICE-PRMU George Nagy 16 Polygram feature match Feature matching Unknown query word: “founding” (tX8j5XnNeEWXwBXEeNnWwSsXXTwSsXnTRwSsnTBnNewXsXXeNnWwSsX!tWwSsXTwSs)(#$%)(nNewBnNewSsXEeNnWwSNeEj65) A reference word: “amendment” (XLXsEeNnWwSsXTwSsnTwBnNeXBTewB)(XWBETnWwSsnTXBnNewS)(EXNnWBsXt)(nNeXEsSwWnNeEBsnTwSXETnWwSsNewBnNeBsXF!tFwSs)(LwS) Febuary 19-20, 2004 IEICE-PRMU George Nagy 17 Febuary 19-20, 2004 IEICE-PRMU George Nagy 18 3
Query hypothesized as “founding”: good match Query hypothesized as “contract”: poor match (tX8j5XnNeEWXwBXEeNnWwSsXXTwSsXnTRwSsnTBnNewXsXXeNnWwSsX!tWwSsXTwSs)(#$%)(nNewBnNewSsXEeNnWwSNeEj65) (tX8j5XnNeEWXwBXEeNnWwSsXXTwSsXnTRwSsnTBnNewXsXXeNnWwSsX!tWwSsXTwSs)(#$%)(nNewBnNewSsXEeNnWwSNeEj65) (XLXsEeNnWwSsXTwSsnTwBnNeXBTewB)(XWBETnWwSsnTXBnNewS)(EXNnWBsXt)(nNeXEsSwWnNeEBsnTwSXETnWwSsNewBnNeBsXF!tFwSs)(LwS) (XLXsEeNnWwSsXTwSsnTwBnNeXBTewB)(XWBETnWwSsnTXBnNewS)(EXNnWBsXt)(nNeXEsSwWnNeEBsnTwSXETnWwSsNewBnNeBsXF!tFwSs)(LwS) Febuary 19-20, 2004 IEICE-PRMU George Nagy 19 Febuary 19-20, 2004 IEICE-PRMU George Nagy 20 Localized Viterbi trellis search InkLink classification algorithm a m e n d m e n t s a c t 1. The expected location where the unknown matches the reference words is pre-computed r 2. The features matches of the unknown against the reference words are found by string matching. t 3. The hypothesis that corresponds best to the expected o n length and location of the matches is chosen. c Febuary 19-20, 2004 IEICE-PRMU George Nagy 21 Febuary 19-20, 2004 IEICE-PRMU George Nagy 22 Our most/least favorite writers Comparison with external system (four writers we like) 100-word lexicons Febuary 19-20, 2004 IEICE-PRMU George Nagy 23 Febuary 19-20, 2004 IEICE-PRMU George Nagy 24 4
Self-corrective recognition (1966) Auto-label adaptation INTIAL REFERENCES NEW REFERENCES FEATURE EXTRACTOR accepted CATEGORIZER REFERENCE GENERATOR rejected SCANNER SOURCE DOCUMENT Febuary 19-20, 2004 IEICE-PRMU George Nagy 25 Febuary 19-20, 2004 IEICE-PRMU George Nagy 26 Results of adaptation Outline (“auto-label”) in InkLink 30 25 Non-representative training sets 20 Error Rate Supervised learning (continuing classifier education) 15 “Unsupervised” adaptation 10 Self-corrective, Decision-directed, Auto-label 5 Symbolic Indirect Correlation (SIC) 0 0 1 2 3 4 5 Style constrained classification Iteration Number Weakly-constrained data distributions Error rate dropped from 28% to 7% . Linguistic context As good with 100 reference words as with 500 reference Recommendations words without adaptation. Febuary 19-20, 2004 IEICE-PRMU George Nagy 27 Febuary 19-20, 2004 IEICE-PRMU George Nagy 28 Symbolic Indirect Correlation (SIC) Signal graph of lever compared to reference signal graph Match Graphs ~ LEVER ~ ~ L E VE R ~ 0 1 2 3 4 5 6 7 8 9 0 1 2 0 1 2 3 4 5 6 7 8 9 0 1 2 SIGNAL GRAPH 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 8 9 0 1 2 3 4 5 6 7 8 9 0 ~ PERI OD~EVER ~ P E O P L E ~ ~P O ~PERI OD ~ 0 1 2 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 8 9 0 1 2 3 4 5 6 7 8 9 0 P E R I O D ~ E V E R ~ P E O P L E ~ ~ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 8 9 7 ~PERIOD~EVER ~PEOPLE~ ~PERIOD ~EVER ~PEOPLE~ ~PERIOD ~EVER ~PEOPLE~ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 LEXICAL GRAPHS E L R~ 0 1 2 3 4 5 6 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 ~ E E ~LEVER~ 0 ~PERPLEX~ ~LEVER~ Febuary 19-20, 2004 IEICE-PRMU George Nagy 29 Febuary 19-20, 2004 IEICE-PRMU George Nagy 30 5
Recommend
More recommend