knn and re ranking models for english knn and re ranking
play

KNN and re ranking models for English KNN and re-ranking models for - PowerPoint PPT Presentation

KNN and re ranking models for English KNN and re-ranking models for English patent mining at NTCIR-7 p g Tong Xiao, Feifei Cao, Tianning Li, Guolong Song, Ke Zhou, Jingbo Zhu and Huizhen Wang Zhu and Huizhen Wang Natural Language Processing


  1. KNN and re ranking models for English KNN and re-ranking models for English patent mining at NTCIR-7 p g Tong Xiao, Feifei Cao, Tianning Li, Guolong Song, Ke Zhou, Jingbo Zhu and Huizhen Wang Zhu and Huizhen Wang Natural Language Processing Lab, Northeastern University (P. R. China) xiaotong@mail.neu.edu.cn

  2. Outline Outline • Overview i • Basic idea • Methodology – KNN based method – KNN-based method – Re-ranking • Experiment E i • Discussion • Summary

  3. Outline Outline • Overview i • Basic idea • Methodology – KNN based method – KNN-based method – Re-ranking • Experiment E i • Discussion • Summary

  4. Introduction of our group Introduction of our group • Natural Language Processing Laboratory, College of N t l L P i L b t C ll f information science and engineering, Northeastern University • Working on a variety of problems related to Natural Language Working on a variety of problems related to Natural Language Processing – Statistical machine translation – Syntactic parsing S i i – Applied semantics ontology learning – Text mining • Focus on patent mining from 2007 • Welcome to our homepage http://www.nlplab.com Welcome to our homepage http://www.nlplab.com

  5. Patent mining task at NTCIR 7 Patent mining task at NTCIR-7 • Patent mining task k <TITLE>End-ventilating adjustable pitch arcuate roof ventilator</TITLE> <ABSTRACT>A roof ridge ventilator is provided, comprising preferably a molded ventilator, with openings – Mapping research papers into along the sides thereof for passage of air therethrough and with openings at ends thereof for passage of air therethrough via gaps provided in pluralities of rows of tabs …</ABSTRACT> patent taxonomy < IPC> F24F_7_02, F24F_7_007 </IPC> <CLAIM>What is claimed is: 1. A roofing ridge ventilator for venting a roof for …</CLAIM> (International Patent …… Classification) • Three sub-tasks patent data – English patent mining – Japanese patent mining title and ranked abstract of the Patent mining list of paper to be input output – Cross language patent mining g g p g system IPC codes searched • We participated in the English patent mining <TITLE> I PC code Rank Score Study on a Natural Ventilation System Using a Pitched E04B_1_70 1 14.23 Roof with Breathing Walls Part 1 Proposal of the g p F24F 7 10 _ _ 2 13.06 sub-task sub task System and Its Design for Ventilation F24F_7_007 3 12.76 </TITLE> F24F_1_00 4 11.70 <ABSTRACT> F24F_7_08 5 11.51 We proposed a natural ventilation system using a F24F_7_013 6 11.38 pitched roof with Breathing Walls, … F24F_7_06 7 9.923 </ABSTRACT> F24F_1_02 8 7.686 …

  6. Outline Outline • Overview i • Basic idea • Methodology – KNN based method – KNN-based method – Re-ranking • Experiment E i • Discussion • Summary

  7. Challenges Challenges • Huge amount of training Huge amount of training USPTO data patents – over 3 million training g PAJ Millions of patents samples – how to train a supervised …… classifier or ranker l ifi k • Huge label set and multi- patent patent IPC taxonomy IPC taxonomy label label Label (IPC) … E F G F24F_7_08 – IPC is a hierarchical F24F_7_10 E06B_7_02 … … classification system classification system F24F_7 … which consists of more … … than 60,000 IPC codes. F24F_7_10 F24F_7_08 F24F_7_06 Very large number of IPC codes

  8. Challenges Challenges • Class imbalance problem of • Class imbalance problem of number of number of patents IPC – The distribution of IPC codes The distribution of IPC codes is skewed IPC code • Different writing styles • Different writing styles IPC1 IPC1 IPC2 IPC2 IPC3 IPC3 IPC4 IPC4 IPC5 IPC5 IPC6 IPC6 between research papers and patents and patents The same topic The same topic – conflicts with the Research Research foundational hypothesis of foundational hypothesis of patent patent patent patent paper paper supervised document classification theory y ? ? Similarity = 1 0 Similarity = 1 0 Similarity = 1.0 Similarity = 1.0

  9. Motivation Motivation • Difficult to apply sophisticated machine learning methods such as maximum Difficult to apply sophisticated machine learning methods such as maximum entropy methods and support vector machines on patent mining – great deal of memory space and time cost is required task – no good solutions to multi-label classification on very large class set d l ti t lti l b l l ifi ti l l t • Test sample Test sample K N K-Nearest Neighboring (KNN) t N i hb i (KNN) method is a comparatively easy Sample in class1 solution Sample in class2 – extracting similar examples and no training process is required – KNN is itself a ranking

  10. Outline Outline • Overview i • Basic idea • Methodology – KNN based method – KNN-based method – Re-ranking • Experiment E i • Discussion • Summary

  11. KNN based method KNN-based method • Key components • Key components Pre Pre- -processing processing p p g g Extracting – KNN-based ranking Research title and abstract paper Tokenization and removing – Re-ranking case info. R ki stemming • Each document is represented as a d KNN KNN- -based ranking based ranking vector in our system English patents Similarity calculation (for training) ranking ranking Re Re- -ranking ranking Rank Rank combination SVM

  12. Similarity calculation Similarity calculation • Calculate the similarity between y Test Sample Test Sample the test sample (research paper) and training samples and the training samples (patents) • State-of-the-art methods BM25 … cosine SMART – Cosine + tfidf – BM25 (Robertson et al, 1998) ( , ) … sim1 sim2 sim3 – SMART (Buckley et al, 1996) – PIV (Singhal et al, 1996) – Or some other … Or some other … ∑ • M Log-linear method λ ⋅ exp( Score ( )) c m m = = m 1 Score ( ) c log-linear ∑ ∑ – Combine different similarities M λ ⋅ exp( Score ( )) c m m = c m 1 (features) to generate a refined (features) to generate a refined similarity – Different weights to different Combined features similarity

  13. Ranking Ranking • • 1. Original KNN ranking method: 4. Listweak/ListweakAver – – Score each IPC code by the number of its to emphasize the patents ranked in the frontier occurrence in the extracted top-k documents part of the list, a new factor is introduced • • 2. Naïve method 5. Weak/WeakAver – – the order of IPC codes follows the order of their A drawback of KNN is the prediction of the input first occurrences in the extracted top-k p document tends to be dominated by the classes y documents with the more frequent examples due to the class imbalance problem – Punish the classes which contain more training • 3. Sum/SumAver 3. Sum/SumAver samples samples – score is calculated by summing up the similarities of all the extracted documents containing the given IPC code – F For SumAver, we average the similarity for each S A h i il i f h sample

  14. Ranking Ranking – method 1 method 1 • 1. Original KNN ranking method: Suppose that we obtain the following list (top-5) after similarity calculation – Score each IPC code by the number of its occurrence in the extracted top-k documents sim Rank Patent(id) IPC 1 p02 IPC1, IPC2 0.21 2 p03 IPC3, IPC4 0.11 • 2. Naïve method 3 p04 0.09 IPC2 4 p05 – IPC2 0.09 the order of IPC codes follows the order of their 5 p01 0.07 IPC1 first occurrences in the extracted top-k p documents Occurred 3 times • 3. Sum/SumAver 3. Sum/SumAver IPC IPC score score – score is calculated by summing up the similarities IPC2 3 of all the extracted documents containing the IPC1 2 given IPC code IPC3 1 – For SumAver, we average the similarity for each F S A h i il i f h IPC4 1 sample IPC list after ranking

  15. Ranking Ranking – method 2 method 2 • 1. Original KNN ranking method: Suppose that we obtain the following list (top-5) after similarity calculation – Score each IPC code by the number of its occurrence in the extracted top-k documents sim Rank Patent(id) IPC 1 p02 IPC1, IPC2 0.21 2 p03 IPC3, IPC4 0.11 • 2. Naïve method 3 p04 0.09 IPC2 4 p05 – IPC2 0.09 the order of IPC codes follows the order of their 5 p01 0.07 IPC1 first occurrences in the extracted top-k p documents first occurrence • 3. Sum/SumAver 3. Sum/SumAver IPC IPC score score – score is calculated by summing up the similarities IPC1 0.21 second of all the extracted documents containing the occurrence IPC2 0.21 given IPC code IPC3 0.11 – For SumAver, we average the similarity for each F S A h i il i f h IPC4 0.11 sample IPC list after ranking

Recommend


More recommend