semantic pdf segmentation for legacy
play

Semantic PDF Segmentation for Legacy Documents in Technical - PowerPoint PPT Presentation

Semantic PDF Segmentation for Legacy Documents in Technical Documentation Jan Oevermann jan.oevermann@dfki.de SEMANTiCS 2018, Vienna, 13.09.18 Technical Documentation 2 Most common: PDF documents Digital Paper, archival &


  1. Semantic PDF Segmentation for Legacy Documents in Technical Documentation Jan Oevermann jan.oevermann@dfki.de SEMANTiCS 2018, Vienna, 13.09.18

  2. Technical Documentation 2 Most common: PDF documents • “Digital Paper”, archival & distribution • ISO Standard, guaranteed reproduction, ubiquitous support Best practice: XML content components • Self-contained building blocks, e.g. chapter-sized, ~150-500 words • Reuse, translation, aggregation, delivery 13.09.18 Jan Oevermann (DFKI), SEMANTiCS 2018, Vienna

  3. Motivation 3 Online Portal Search Description Task De Desc sc De Desc sc Desc Desc Desc Task Task XML XM Task Task Task XML XM XML XML PDF PDF 13.09.18 Jan Oevermann (DFKI), SEMANTiCS 2018, Vienna

  4. Motivation 4 Only safety information of the document I need maintenance information about the fuel injection Everything about the hydraulic pump in technical overview or technical data Faceted search Information request with semantic concepts which can be used as facets 13.09.18 Jan Oevermann (DFKI), SEMANTiCS 2018, Vienna

  5. Motivation 5 Limitations of PDF • Semantic structure gets lost • No metadata for (overlapping) segments • Large documents (>200p) only accessible via full text search Idea • Use knowledge from structured XML content components • Manually annotated semantic concepts / metadata • Apply trained model on text extracted from PDF • Find segments which are semantically relevant 13.09.18 Jan Oevermann (DFKI), SEMANTiCS 2018, Vienna

  6. Procedure model 6 13.09.18 Jan Oevermann (DFKI), SEMANTiCS 2018, Vienna

  7. Training / Classification 7 Learning phase Classification Weighting Feature extraction (TF-ICF-CF) (Bag o n-grams) Training data Model (VSM) Classifier New data cosine similarity/ Prediction (unclassified) k -nearest neighbour 13.09.18 Jan Oevermann (DFKI), SEMANTiCS 2018, Vienna

  8. Chunking 8 13.09.18 Jan Oevermann (DFKI), SEMANTiCS 2018, Vienna

  9. Chunking / Classification 9 13.09.18 Jan Oevermann (DFKI), SEMANTiCS 2018, Vienna

  10. 10 Range finding 13.09.18 Jan Oevermann (DFKI), SEMANTiCS 2018, Vienna

  11. Metadata generation 11 https://iirds.org/ 13.09.18 Jan Oevermann (DFKI), SEMANTiCS 2018, Vienna

  12. 12 Metadata generation 13.09.18 Jan Oevermann (DFKI), SEMANTiCS 2018, Vienna

  13. 13 Application Live demo Jan Oevermann (DFKI), SEMANTiCS 2018, Vienna

  14. 14 Results 13.09.18 Jan Oevermann (DFKI), SEMANTiCS 2018, Vienna

  15. Outlook & Conclusion 15 Outlook • Other text sorts (e.g. patents) or document types (e.g. Word) • Combination with other techniques (formatting / heuristics) Conclusion • Method relies on text and is formatting-independent • No splitting of PDF, just additional metadata • Good results in detecting semantic segments • Identified ranges can be provided in a standardized format 13.09.18 Jan Oevermann (DFKI), SEMANTiCS 2018, Vienna

  16. 16 Contact Jan Oevermann Code & Demo jan.oevermann@dfki.de github.com/j-oe/segments www.janoevermann.de segments.fastclass.de 13.09.18 Jan Oevermann (DFKI), SEMANTiCS 2018, Vienna

Recommend


More recommend