Teaching Unstructured Information Management: Theory and - PowerPoint PPT Presentation

Teaching “Unstructured Information Management: Theory and Applications” to Computational Linguistics Students Iryna Gurevych, Christof Müller, Torsten Zesch Ubiquitous Knowledge Processing Group Telecooperation, Computer Science Department Darmstadt University of Technology

Typical NLP course • Project topic � Yet another tokenizer • Project results � Unstable software � Works only under special preconditions � Hard-coded configuration - “The software has to be installed in directory foo “ - “The name of the input file has to be foobar ”

Goals of our NLP course • Teach basics in unstructured information management • Separate software engineering from NLP � Provide a framework and preprocessing components • Enabling students to: � Concentrate on computational linguistics part � Work on more challenging/motivating tasks Using UIMA to reach these goals

Course outline • Compact seminar 1. Lecture � 6 sessions � 4 hours each 2. UIMA basics • Course requirements (MA level) 3. Annotators � Participation � Implement a practical project 4. Consumers & Readers � Deliver results as PEAR package � Write a course paper 5. CPEs & PEAR packages 6. Wrap up, Q&A Student projects

Student projects • Suitable task were defined in collaboration with lecturers • Selected projects: � Annotating Wikipedia articles � Extracting lexical semantic information from blogs � Named entity recognition � Sentiment detection � Word sense disambiguation

Annotating Wikipedia Articles • Annotate structural elements in Wikipedia articles � Sections, paragraphs, lists, bold terms, ... • Visualize annotations • Wikipedia API is provided to retrieve articles UIMA reader UIMA analysis engine UIMA consumer Structural Wikipedia Visualizer elements article reader annotator

Lexical Semantic Information from Blogs • Analyze blogs • Find keywords • Detect semantic relations between keywords Desired output:

Lexical Semantic Information from Blogs proposed by the students. UIMA components as

Named Entity Recognition • Hybrid approach: rules + gazetteers • Preprocessing components were provided • GermaNet and Wikipedia are accessed as UIMA resources

Sentiment Detection • Detect sentiment expressions and link them with the judged entity • Preprocessing components were provided • Robust NER component is required, but not yet available for UIMA • Used GATE-UIMA interoperability layer to integrate ANNIE tool UIMA reader GATE component UIMA analysis engine UIMA consumer UIMA-GATE GATE-UIMA Text input Sentiment Result writer NER reader Detector

Word Sense Disambiguation • Implements the WSD approach by Patwardhan and Pedersen (2006) • Necessary word glosses are generated using GermaNet • GermaNet is accessed as a UIMA resource • Preprocessing components were provided UIMA reader UIMA analysis engines UIMA consumer Provided Text input Result writer WSD preprocessing reader components

Lessons Learned • Advantages of using UIMA � Provide necessary preprocessing tools � Enables more challenging/motivating tasks � Uniform structure of project results (PEAR package) � Students can concentrate on their core competences � Focus is on modeling rather than programming • Challenges � Complexity of UIMA architecture � Motivate students • Possible solution � Provide a preconfigured work environment vs. Learn UIMA

Thank you very much! Thank you very much! http://www.ukp.tu-darmstadt.de/ • Acknowledgments: � Prof. Erhard Hinrichs for his idea to offer the course � ISCL students participating - Jonathan Khoo, Niels Ott, Sladjana Pavlovic, Maria Tchalakova, Bela Usabaev, Desislava Zhekova, Ramon Ziai

Teaching Unstructured Information Management: Theory and - PowerPoint PPT Presentation

Teaching Unstructured Information Management: Theory and Applications to Computational Linguistics Students Iryna Gurevych, Christof Mller, Torsten Zesch Ubiquitous Knowledge Processing Group Telecooperation, Computer Science

Unstructured Data Typically refers to free text I Allows I G Keyword queries including

UIMA: Unstructured Information Management Architecture Alessandro Moschitti Department of

73.7% of Healthcare information remains unstructured Healthcare Smart Information Management S

Topology Management for Unstructured Jo ao Leit ao Introduction Overlay Networks Overview

Using Web Annotations to Represent Relations between Structured and Unstructured Information in

Semantic annotation of unstructured and ungrammatical text Matthew Michelson and Craig A.

A Reference-Set Approach to Information Extraction from Unstructured, Ungrammatical Data Sources

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

The Web document collection No design/co-ordination I Unstructured (text, html, ),

A Reference-Set Approach to Information Extraction from Unstructured, Ungrammatical Data Sources

Information Replication Strategy in Unstructured Peer-to-Peer Networks Using Thematic Agents

CFD General Notation System (CGNS) Usage for unstructured grids Edwin van der Weide Stanford

Teaching Formal Set Theory with Regard to Students Comprehension Libor Bhounek Workshop on

Network PDF file unstructured information Pengchanghuan,Sunwei,Fuxiaohan Shanghai Jiaotong

Overview Coding and Information Theory What is information theory? Entropy Coding Chris

Management Information Systems and Information Systems Management Miguel Mira da Silva

Information Economics The Signaling Theory Ling-Chieh Kung Department of Information Management

HUMAN-POWERED DATA MANAGEMENT ! ! Aditya Parameswaran ! ! with H. Garcia-Molina, ! J. Widom, A.

What is management information systems g y General system theory (Science, shared structure) )

Semester 2 and Tri 2A 2016 Internal Students Presentation Information Sheet (25%) Due Teaching

Information Economics The Moral Hazard Theory Ling-Chieh Kung Department of Information