[ RMLL 2013, Bruxelles – Thursday 11 th July 2013 ] Presentation of OpenNLP Presenter : Dr Ir Robert Viseur
What is OpenNLP ? • Toolkit for the processing of natural language text. • Project of the Apache Foundation. • Developped in Java. • Under Apache License, Version 2. • Download and documentation: http://opennlp.apache.org/ . 2
What are the features ? • For common NLP tasks : • tokenization, • sentence segmentation, • part-of-speech tagging, • named entity extraction, • chuncking. 3
What is the part-of-speech tagging ? • Example : • See more: http://opennlp.apache.org/documentation/1.5.3 /manual/opennlp.html . 4
What is the named entity extraction ? • Example : • See more: http://opennlp.apache.org/documentation/1.5.3 /manual/opennlp.html . 5
How does it work ? (1/2) • The features are associated to pre-trained models. • Each pre-trained model is created for one language and for one type of use. • Supported languages: da, de, en, es, nl, pt, se. • Warnings : – The functional coverage varies with languages. – The french language is not supported ! • See http://opennlp.sourceforge.net/models- 1.5/ . • Use in command line or as a Java library. • Warning : loading time of models with CLI. 6
How does it work ? (2/2) • Example (English vs Spanish languages) : 7
What are the criteria of choice ? • Support of the product. • License. • Available languages. • Precision / Recall. • Speed of text processing. 8
Are there free (as freedom) alternative tools ? • Other light tools : • Stanford Log-linear Part-Of-Speech Tagger (POST), • Stanford Named Entity Recognizer (NER), • TagEN, • Java Automatic Term Extraction toolkit. • Frameworks : • In Java : UIMA (Java), GATE (Java). • In other languages : NLTK (Python). 9
Example: tag cloud creation (1/6) • Starting point: website. • Example: www.adacore.com . • What we want (from website content): • common tag cloud, • circular tag cloud. • Main steps : crawl, cleaning of HTML documents, named entities (person) and terminology extractions (+ merge) and display (tag cloud). 10
Example: tag cloud creation (2/6) • Cleaning: • Remove the HTML tags and keep only the useful content. • Warnings: • NLP tools are sensitive to noise in raw data. • Pay attention to the language of the document. • Use of HTML boilerplate tool (HTML -> TXT). • Tool: Boilerpipe. • See http://code.google.com/p/boilerpipe/ . • Next: normalization of the text. 11
Example: tag cloud creation (3/6) • Named entities extraction. • Standard in OpenNLP : OpenNLP adds tags in text. • Here : extraction of Person NE. • Terminology extraction. • First : part-of-speech tagging (POST). • Next : identification et filtering (threshold) of : • collocations (i.e: Name_Name, Adjective_Name,...), • proper names (often: brands or people). 12
Example: tag cloud creation (4/6) • Process : Website Crawl Website (local) (Internet) ---- --- -- ----. Raw HTML Conversion --- -- -- -- ---- document to text --- -- ----. Normalization _--- _-- _-- _ ---- --- -- ----. POS _---- _--. --- -- -- -- ---- tagging _--- _-- _-- _-- --- -- ----. Terminology NE extraction extraction _____ _____ _____ _____ Merge _____ _____ Tags Tag cloud (for a website) 13
Example: tag cloud creation (5/6) • Result: common tag cloud. 14
Example: tag cloud creation (6/6) • Result: circular tag cloud. 15
Thanks for your attention. Any questions ? 16
Contact Dr Ir Robert Viseur Email (@CETIC) : robert.viseur@cetic.be Email (@UMONS) : robert.viseur@umons.ac.be Phone : 0032 (0) 479 66 08 76 Website : www.robertviseur.be This presentation is covered by « CC-BY-ND » license. 17
Recommend
More recommend