LOC-DB Reference Extraction DR. DR.-ING SHERAZ AHMED SYED TA TAHSEEN RAZA RIZVI
LOC-DB Architecture 2
LOC-DB OCR Component Types of Input files: ◦ Digital Born PDF ◦ Scanned Documents ◦ XML/HTML XML File Scanned Document Textual PDF 3
Reference Extraction from: Scanned Documents 4
Scanned Documents: Reference Extraction ◦ Step 1: Binarization ◦ Greyscale(0-255)/color to Binary (0-1) RGB Image Binary Image 5
Scanned Documents : Reference Extraction ◦ Step 2: Image Classification ◦ Single/Double Column Documents Single Column Documents Double Column Documents Single Column Document Double Column Document 6
Scanned Documents : Reference Extraction ◦ Step 3: OCR (Optical Character Recognition) OCR Result 7
Scanned Documents : Reference Extraction ◦ Step 4: Reference Segmentation ◦ Using ParsCit 8
Reference Extraction from: Textual / Digital Born PDFs 9
Digital Born PDFs : Reference Extraction ◦ Step 1: Text Extraction Textual PDF Extracted Text 10
Digital Born PDFs : Reference Extraction ◦ Step 2: Reference Extraction ◦ Using ParsCit 11
Reference Extraction from: Structured XML 12
Structured XML : Reference Extraction ◦ Step 1: Preprocessing 13
Structured XML : Reference Extraction ◦ Step 2: Reference Extraction ◦ Using ParsCit 14
Reference Extraction Pipeline - Overview Image Scanned Reference Segmentation Binarization OCR Classification Documents Textual Text Extraction PDFs Structured Pre-Processing XML 15
DeepBibX: A Neural Network based approach 16
DeepBibX: Intuition 17
Neural Network Based Approach 18
Comparison with ParsCit ParsCit Output DeepBibX Output 19
Comparison with ParsCit On a test set of 286 bibliographic document Extraction Comparison images: 6000 Total: 5090 references ParsCit extracted: 3645 references 5000 Proposed approach: 4323 references 4000 3000 2000 1000 0 ParsCit FCN based approach Total References Total Detections 20
Thank you 21
Recommend
More recommend