Lucene And Solr Document Classification Alessandro Benedetti, - PowerPoint PPT Presentation

Lucene And Solr Document Classification Alessandro Benedetti, Software Engineer, Sease Ltd.

Who I am Alessandro Benedetti ● Search Consultant ● R&D Software Engineer ● Master in Computer Science ● Apache Lucene/Solr Enthusiast ● Semantic, NLP, Machine Learning Technologies passionate ● Beach Volleyball Player & Snowboarder

Agenda ● Classification ● Lucene Approach ● Solr Integration ● Demo ● Extensions ● Future Work

Classification “Classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. “ Wikipedia

Real World Use Cases ● E-mail spam filter ● Document categorization ● Sexually explicit content detection ● Medical diagnosis ● E-commerce ● Language identification

Basics Of Text Classification ● Supervised learning ● Labelled training samples ● Documents modelled as feature vectors ● Term occurrences as features ● Model predicts unseen documents label

Apache Lucene Apache Lucene TM is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. Apache Lucene is an open source project available for free download.

Apache Lucene For Classification ● Lucene index has complex data structures ● Lot of organizations have already indexes in place ● Pre existent data can be used to classify ● No need to train a model from a separate training set ● From training set to Inverted index

Apache Lucene For Classification ● Advanced configurable text analysis ● Term frequencies ● Term positions ● Document frequencies ● Norms ● Part of speech tags and custom payload

K Nearest Neighbours ● Given an index with labelled documents ● Each document has a class field ● Given an unknown document in input ● Given a set of relevant fields ● Search the top K most similar documents ● Fetch the classes from the retrieved documents ● Return most occurring class(es)

More Like This ● KNN uses Lucene More Like This ● Lucene query component ● Extract interesting terms* from the input document fields ● Build a Lucene query ● Run the query against the search index ● Resulting documents are “the similar documents” * an interesting term is a term : - occurring frequently in the seed document (high term frequency) - but quite rare in the corpus (high inverted document frequency)

Naive Bayes Classifier Assumptions ● Term occurrences are probabilistic independent features ● Terms positions are irrelevant ( bag of words ) Calculate the probability score of each available class C ● Prior ( #DocsInClassC / #Docs ) ● Likelihood ( P(d|c) = P(t1, t2,..., tn|c) == P(t1|c) * P(t2|c) * … * P(tn|c)) Where given term t P(t|c) = TF(t) in documents of class c +1 / #terms in all documents of class c + #docs of class c Assign top scoring class

Document Classification ● Documents are the Lucene unit of information ● Documents are a map field -> value ● Each field may be analysed differently (different tokenization and token filtering) ● Each field may have a different weight for the classification (affecting differently the similarity score)

Apache Solr Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search and analytics, rich document parsing, geospatial search, extensive REST APIs as well as parallel SQL.

Solr Integration Index Time Integration - SOLR-7739 ● Ingest the document ● Assign the class ● Set the class as a field value ● Index the document Request Handler Integration (TO DO) - SOLR-7738 Return an assigned class : ● Given a text and a field ● Given an input document ● Given an indexed document id

Update Request Processor Chain ● Pipeline of processors ● Each single document flows through the chain ● Each processor is executed once ● Last processor triggers the update command

Update Request Processor ● Update Component ● Configurable Singleton Factory ● Single instance per request thread ● Process a single Document ● SolrCloud compatible* * Pre processor / Post processor

Classification Update Request Processor ● Access the Index Reader ● A Lucene Document Classifier is instantiated ● A class is assigned by the classifier ● A new field is added to the original Document, with the class ● The document goes through the next processing steps

Solrconfig.xml - Update Handler ... <requestHandler name="/update" > <lst name="defaults"> <str name="update.chain">classification</str> </lst> </requestHandler> ...

Solrconfig.xml - Chain configuration ... <updateRequestProcessorChain name="classification"> <processor class="solr.ClassificationUpdateProcessorFactory"> ... </processor> <processor class="solr.RunUpdateProcessorFactory"/> </updateRequestProcessorChain> ...

Solrconfig.xml - K nearest neighbour classifier config <processor class="solr.ClassificationUpdateProcessorFactory"> <str name="inputFields">title^1.5,content,author</str> <str name="classField">cat</str> <str name="algorithm">knn</str> <str name="knn.k">20</str> <str name="knn.minTf">1</str> <str name="knn.minDf">5</str> </processor> N.B. classField must be stored

Solrconfig.xml - Naive Bayes classifier config <processor class="solr.ClassificationUpdateProcessorFactory"> <str name="inputFields">title^1.5,content,author</str> <str name="classField">cat</str> <str name="algorithm">bayes</str> </processor> N.B. classField must be Indexed (take care of analysis)

Solr Classification - Important Notes ● Lucene >= 6.0 ● Solr >= 6.1 ● Classification needs a training set -> An index with initially human assigned classes is required

Solr Classification - Demo ● Sci-Fi StackExchange dataset ● Roughly 18.000 questions and answers ● 70 % Training Set + 30% test set

Solr Classification - Demo ● Index the training set documents (this is our ground truth) ● Index the test set (classification will happen automatically at indexing time) ● Evaluate the test set (a simple java app to verify that the automatically assigned classes are consistent with what expected)

Solr Classification - Extensions SOLR-8871 Multi classes support ● Class field may be multi valued ● Assign multiple classes ● Not only the top scoring but top N (parameter) Split human/auto assigned classes ● classTrainingField ● classOutputField Default : use the same field

Solr Classification - Extensions SOLR-8871 Classification Context Filtering ● Reduce the document space to consider -> reduce the training set ● Useful when only a subset of the index may be interesting for classification ● Consider only the human labelled documents as training data

Solr Classification - Extensions SOLR-8871 Individual Field Weighting ● When classifying, each field has a different importance e.g. title vs content ● Set a different boost per field ● Knn compatible ● Bayes compatible

Solr Classification - Future Work ● Numeric Field Support (Knn) (Euclidean distance based) ● Lat lon support (Knn) (geo distance based) ● SolrCloud support (use the entire sharded index as training set)

Questions ?

● Special thanks to Tommaso Teofili, Apache committer who followed the developments and made possible the contributions. ● And to the Audience :)

Lucene And Solr Document Classification Alessandro Benedetti, - PowerPoint PPT Presentation

Lucene And Solr Document Classification Alessandro Benedetti, Software Engineer, Sease Ltd. Who I am Alessandro Benedetti Search Consultant R&D Software Engineer Master in Computer Science Apache Lucene/Solr

Apache Lucene 5 New Features and Improvements for Apache Solr and Elasticsearch Uwe Schindler

Apache Solr An experience report 2013-10-23 - Corsin Decurtins Apache Solr Notes Full-Text

Apache Lucene - a library retrieving data for millions of users Simon Willnauer Apache Lucene

Drupal and Solr Saturday, August 30, 2008 1 Hello Im Alexandru Badiu Drupal and Solr -

Bug hunting with Apache Lucene Uwe Schindler Apache Lucene PMC & Apache Software Foundation

Joining in Lucene Martijn van Groningen martijn.vangroningen@searchworkings.com Lucene Committer

Bugs, Bugs, Bugs Uwe Schindler Apache Lucene Committer & PMC Member uschindler@apache.org

Advanced Document Similarity With Apache Lucene Alessandro Benedetti, Software Engineer, Sease

Language support and linguistics in Lucene, Solr and ElasticSearch and the eco-system June 3rd,

What's coming next? Uwe Schindler SD DataSolutions GmbH / Apache Software Foundation thetaph1

Beyond full-text searches With Lucene and Solr Bertrand Delacrtaz ApacheCon EU 2007, Amsterdam

Open-Source Search Engines and Lucene/Solr UCSB 293S, 2017. Tao Yang Slides are based on Y.

Building and Running a Solr-as-a-Service SHAI ERERA IBM Who Am I? Working at IBM Social

optimizations for e-commerce search with Apache Solr Tomasz Sobczak, MICES 2017 About me Work

BM25 is so Yesterday Modern Techniques for Better Search Relevance in Solr Grant Ingersoll CTO

Beyond the Solr Eclipse Building blazing fast Drupal 8 search with Solr and no code TANAY SAI

Use of IT in own tasks and business and how it is learnt Learning objectives Identify

CS445 / SE463 / ECE 451 / CS645 Software requirements specification & analysis Ambiguity

Autonomous Intelligent Robotics Instructor: Shiqi Zhang

USENET History ( 1 ) > The first USENET In 1979 Tom Truscott, Jim Ellis, Steve

A Gentle Introduction to Machine Learning Supervised learning, unsupervised learning (very

CS6220: DATA MINING TECHNIQUES Chapter 8&9: Classification: Part 4 Instructor: Yizhou Sun

Text Mining on Mailing Lists: Sentiment Analysis Gordon Heiczman, B. Sc. October 13, 2017 Chair

Contributing to LibreOffjce without C++ knowledge Ilmari Lauhakangas, TDF

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

Lucene And Solr Document Classification Alessandro Benedetti, - PowerPoint PPT Presentation

Lucene And Solr Document Classification Alessandro Benedetti, Software Engineer, Sease Ltd. Who I am Alessandro Benedetti Search Consultant R&D Software Engineer Master in Computer Science Apache Lucene/Solr

Apache Lucene 5 New Features and Improvements for Apache Solr and Elasticsearch Uwe Schindler

Apache Solr An experience report 2013-10-23 - Corsin Decurtins Apache Solr Notes Full-Text

Apache Lucene - a library retrieving data for millions of users Simon Willnauer Apache Lucene

Drupal and Solr Saturday, August 30, 2008 1 Hello Im Alexandru Badiu Drupal and Solr -

Bug hunting with Apache Lucene Uwe Schindler Apache Lucene PMC &amp; Apache Software Foundation

Joining in Lucene Martijn van Groningen martijn.vangroningen@searchworkings.com Lucene Committer

Bugs, Bugs, Bugs Uwe Schindler Apache Lucene Committer &amp; PMC Member uschindler@apache.org

Advanced Document Similarity With Apache Lucene Alessandro Benedetti, Software Engineer, Sease

Language support and linguistics in Lucene, Solr and ElasticSearch and the eco-system June 3rd,

What's coming next? Uwe Schindler SD DataSolutions GmbH / Apache Software Foundation thetaph1

Beyond full-text searches With Lucene and Solr Bertrand Delacrtaz ApacheCon EU 2007, Amsterdam

Open-Source Search Engines and Lucene/Solr UCSB 293S, 2017. Tao Yang Slides are based on Y.

Building and Running a Solr-as-a-Service SHAI ERERA IBM Who Am I? Working at IBM Social

optimizations for e-commerce search with Apache Solr Tomasz Sobczak, MICES 2017 About me Work

BM25 is so Yesterday Modern Techniques for Better Search Relevance in Solr Grant Ingersoll CTO

Beyond the Solr Eclipse Building blazing fast Drupal 8 search with Solr and no code TANAY SAI

Use of IT in own tasks and business and how it is learnt Learning objectives Identify

CS445 / SE463 / ECE 451 / CS645 Software requirements specification &amp; analysis Ambiguity

Autonomous Intelligent Robotics Instructor: Shiqi Zhang

USENET History ( 1 ) &gt; The first USENET In 1979 Tom Truscott, Jim Ellis, Steve

A Gentle Introduction to Machine Learning Supervised learning, unsupervised learning (very

CS6220: DATA MINING TECHNIQUES Chapter 8&amp;9: Classification: Part 4 Instructor: Yizhou Sun

Text Mining on Mailing Lists: Sentiment Analysis Gordon Heiczman, B. Sc. October 13, 2017 Chair

Contributing to LibreOffjce without C++ knowledge Ilmari Lauhakangas, TDF

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

Bug hunting with Apache Lucene Uwe Schindler Apache Lucene PMC & Apache Software Foundation

Bugs, Bugs, Bugs Uwe Schindler Apache Lucene Committer & PMC Member uschindler@apache.org

CS445 / SE463 / ECE 451 / CS645 Software requirements specification & analysis Ambiguity

USENET History ( 1 ) > The first USENET In 1979 Tom Truscott, Jim Ellis, Steve

CS6220: DATA MINING TECHNIQUES Chapter 8&9: Classification: Part 4 Instructor: Yizhou Sun