loc db reference extraction
play

LOC-DB Reference Extraction DR. DR.-ING SHERAZ AHMED SYED TA - PowerPoint PPT Presentation

LOC-DB Reference Extraction DR. DR.-ING SHERAZ AHMED SYED TA TAHSEEN RAZA RIZVI LOC-DB Architecture 2 LOC-DB OCR Component Types of Input files: Digital Born PDF Scanned Documents XML/HTML XML File Scanned Document Textual PDF


  1. LOC-DB Reference Extraction DR. DR.-ING SHERAZ AHMED SYED TA TAHSEEN RAZA RIZVI

  2. LOC-DB Architecture 2

  3. LOC-DB OCR Component Types of Input files: ◦ Digital Born PDF ◦ Scanned Documents ◦ XML/HTML XML File Scanned Document Textual PDF 3

  4. Reference Extraction from: Scanned Documents 4

  5. Scanned Documents: Reference Extraction ◦ Step 1: Binarization ◦ Greyscale(0-255)/color to Binary (0-1) RGB Image Binary Image 5

  6. Scanned Documents : Reference Extraction ◦ Step 2: Image Classification ◦ Single/Double Column Documents Single Column Documents Double Column Documents Single Column Document Double Column Document 6

  7. Scanned Documents : Reference Extraction ◦ Step 3: OCR (Optical Character Recognition) OCR Result 7

  8. Scanned Documents : Reference Extraction ◦ Step 4: Reference Segmentation ◦ Using ParsCit 8

  9. Reference Extraction from: Textual / Digital Born PDFs 9

  10. Digital Born PDFs : Reference Extraction ◦ Step 1: Text Extraction Textual PDF Extracted Text 10

  11. Digital Born PDFs : Reference Extraction ◦ Step 2: Reference Extraction ◦ Using ParsCit 11

  12. Reference Extraction from: Structured XML 12

  13. Structured XML : Reference Extraction ◦ Step 1: Preprocessing 13

  14. Structured XML : Reference Extraction ◦ Step 2: Reference Extraction ◦ Using ParsCit 14

  15. Reference Extraction Pipeline - Overview Image Scanned Reference Segmentation Binarization OCR Classification Documents Textual Text Extraction PDFs Structured Pre-Processing XML 15

  16. DeepBibX: A Neural Network based approach 16

  17. DeepBibX: Intuition 17

  18. Neural Network Based Approach 18

  19. Comparison with ParsCit ParsCit Output DeepBibX Output 19

  20. Comparison with ParsCit  On a test set of 286 bibliographic document Extraction Comparison images: 6000  Total: 5090 references  ParsCit extracted: 3645 references 5000  Proposed approach: 4323 references 4000 3000 2000 1000 0 ParsCit FCN based approach Total References Total Detections 20

  21. Thank you 21

Recommend


More recommend