Darmstadt Knowledge Processing Repository Based on UIMA Iryna Gurevych, Max Mühlhäuser, Christof Müller, Jürgen Steimle, Markus Weimer, Torsten Zesch Ubiquitous Knowledge Processing Group Telecooperation, Computer Science Department Darmstadt University of Technology
Telecooperation
Telecooperation
THESEUS Darmstadt Knowledge Processing Software Ubiquitous Knowledge Processing Repository AQUA SIR
A utomatic Qu ality A ssessment and Feedback in eLearning 2.0 (AQUA) AQUA 5
User Generated Discourse in Web 2.0
AQUA – Anoto pen
AQUA - System Architecture Natural Language Processing Machine Learning
AQUA – System Architecture
SIR (in cooperation with Prof. Hinrichs) • Semantic Information Retrieval Natural language low level expression of communication information need interface Bridge the human – computer gap Semantic search (SIR) based on semantic relatedness Natural language low level expression of communication information need interface
Information Retrieval (IR) Boolean, Vector Space, ... Document ... Keywords Document 2 � Document 1 Document ... Document ... Document ... Document 3
SIR-Project baker, to program, Semantic Relatedness quality assurance Profession ... Essay Profession 2 Profession 1 cake, computer, Profession ... to read, ... Profession ... Profession ... Semantic search (SIR) based on semantic relatedness Profession 3 Natural language low level expression of communication information need interface
SIR Example find good index terms Compound Splitting Negation Detection WSD compute semantic relatedness
THESEUS - TEXO • Large-scale BMBF-Project, industry (SAP, Siemens, etc.) • Service Marketplaces in Web 2.0 � Find services, both users and machines • Problem: � Only keyword-based search � Lack of ontologies for semantic search • Solution: � Use natural language descriptions of web services � Apply Semantic Information Retrieval � Community Mining for optimized service selection � Darmstadt Knowledge Processing Repository
UIMA components SIR AQUA THESEUS Wikipedia reader, Forum reader , Plain text reader Data import Tokenizer, Sentence splitter, Stopword tagger Linguistic preprocessing Stemmer, Lemmatizer, Compound Splitter Morphological analysis PoS-Tagger, Parser Syntactic analysis NE tagger, Sentiment detector, WSD component Semantic analysis Swear word tagger (AQUA), Negation detection (SIR) Project specific analysis Indexer (Lucene, Terrier), ARFF export Data export
Advantages of UIMA • Components can be shared between projects • Shared model of thinking � “Reader + Annotators + Consumer” � Configuration of components • Descriptive component orchestration
Challenges • Agree on a type system � No automatic type mapping • Some rough edges in UIMA � No real plug’n’work with PEAR packages � Using constraints to align annotations seems to be slow
Wish list • Automatic type matching • Better tool support � Improving Eclipse plug-ins (robustness, features) � Refactoring of UIMA components � CPE runner ++ (automatic logging, performance monitor, etc.) • Plug’n’work approach • “Import by name” in CPEs � Or make ${CPM_HOME}/path also work for readers/consumers • Construct XML descriptors from Java annotations • More intuitive API
Thank you very much! Thank you very much! • Acknowledgements: � DFG for funding “Semantic Information Retrieval” � DFG for funding “Automatic Quality Assessment and Feedback in eLearning 2.0” http://www.ukp.tu-darmstadt.de/
Recommend
More recommend