OCR for CJK Mark Ravina CEAL Technology Forum 2018
• I am an OCR end-user, not an OCR developer • I am a passionate open-source advocate • Commercial software needs to be worth the cost
Can we put high-quality CJK OCR on every computer in a library for for little or no cost?
Options for Japanese OCR GoogleDrive Free Poor Adobe Acrobat Expensive Adequate ABBYYFineReader $199 Very good eTypist ¥19,800 Very good Tesseract Free Very good
“Tesseract is an open-source OCR engine that was developed at HP between 1984 and 1994. Like a supernova, it appeared from nowhere for the 1995 UNLV Annual Test of OCR Accuracy, shone brightly with its results . . .
. . . . and then vanished back under the same cloak of secrecy under which it had been developed. Now for the first time, details of the architecture and algorithms can be revealed.” Ray Smith, Google (2007)
Tesseract currently supports: • Japanese • Korean • Chinese simplified • Chinese traditional • Vietnamese • Uighur • Lao • Khmer
Challenges to deploying Tesseract compared to commercial OCR • Command line interface • Documentation is incomprehensible to most end- users • Reads tiff but not pdf • Lacks strong page segmentation, image cleaning, and pre-processing • Lacks specialized features: e.g., furigana extraction
Command line interface • Command structure is simple . . .
Replacements for command line Command structure is simple . . . • Example: tesseract sample.tiff output –l jpn PDF But, many users are intimidated by ANY command line • interface
Replacements for command line Command structure is simple . . . • Example: tesseract sample.tiff output –l jpn PDF But, many users are intimidated by ANY command line interface • File locations can be cumbersome and confusing: • /Volumes/Transcend/Documents_prime/Research papers/Meiji_petitions_research/v5/tiffs/combined
Replacements for command line Write a small script for end users • Automator on Mac OSX • Third-party options for Windows • Drag and drop • Add options or multiple icons for different • layouts and languages?
Challenges to deploying Tesseract compared to commercial OCR • Command line interface • Documentation is incomprehensible to most end- users • Reads tiff but not pdf • Lacks strong page segmentation, image cleaning, pre-processing • Lacks specialized features: e.g., furigana extraction
PDF to TIFF conversion • Can be included in Automator on Mac OSX • Free Third-party websites
Challenges to deploying Tesseract compared to commercial OCR • Command line interface • Documentation is incomprehensible to most end- users • Reads tiff but not pdf • Lacks strong page segmentation, image cleaning, pre-processing • Lacks specialized features: e.g., furigana extraction
Text segmentation problems: shift from vertical to horizontal
OCR will often put this header text in the middle of body text
Similar problems for page numbers . . .
OCR will often try to interpret shadows as text, with confusing results
Half line glosses make OCR crazy
Improving page segmentation and image processing Use OS X bundled tools • Contrast • Crop • Greyscale • Pre-process with open source tools • ImageMagick (noise and skew) • OpenCV • Don’t use Tesseract on dirty or complex images •
Recommend
More recommend