NTCIR-7 Patent Mining Experiments at RALI Guihong Cao, Jian-Yun Nie - PowerPoint PPT Presentation

NTCIR-7 Patent Mining Experiments at RALI Guihong Cao, Jian-Yun Nie and Lixin Shi Department of Computer Science and Operations Research University of Montreal, Canada 1

Outline • Introduction • Our Approaches • Issues Investigated • Experiments • Conclusion 2

Introduction • Patent Mining Project – Each patent has an IPC code • Task – Query: abstract of a research paper – Document collection: patents with IPC code – Task: assign IPC codes to each research paper according to the relevance • Possible solution – View it as a text categorization problem 3

Introduction (Cont.) • Difference in writing style for patent and research paper – Patent: more general terms to cover more related things – Research paper: more precise and technical Eg. Music player VS Apple iPod • Complexity in classification problem – More than 50,000 IPC codes – Very unbalanced – Cannot be tackled with traditional text classification approaches 4

Distribution of IPC codes in US patents #Patent #IPC #Patent #IPC 1~10 25944 2001~3000 5 11~100 10911 3001~4000 3 101~500 1430 4001~5000 0 501~1000 129 >5000 23 1001~2000 46 5

Outline • Introduction • Our Approaches – Basic approach – System Description • The Issues Investigated • Experiments • Conclusion 6

Basic Approach • Classify the research paper with K-NN classifier – The patents are labeled instances – Measure the distance between patents and research paper according to relevance • Finding closest documents with information retrieval – Language modeling approach for information retrieval – Measuring relevance by query likelihood 7

Language Modeling Approach for Information Retrieval • Documents are represented with unigram models, i.e., P(w|D) – P(w|D) is smoothed to avoid zero probablity (Zhai and Lafferty, 2001) tf ( w , D ) = λ + − λ P ( w | D ) ( 1 ) P ( w | C ) | D | • A query is represented as a sequence of words • Relevance is measured by the likelihood of query with respect to the document model ∏ = P ( q | D ) P ( q | D ) i q i 8

System Description • The whole system is implemented using INDRI system (Strohman et al, 2005) • INDRI system – Language modeling approach for IR – Allowing retrieval using different fields • Classification algorithm K ∑ ( ) = δ = score ( c , q ) ipc ( d ) c P ( q | d ) i i = i 1 ( ) : indicator function δ i = ipc ( d ) c 9

Outline • Introduction • Our Approaches • The Issues Investigated • Experiments • Conclusion 10

Investigations • Term Distillation – Aiming to solve different styles between research paper and patent description • Some common words in research paper are not common words in patent description e.g. paper, study, propose • Introducing noises to patent retrieval • Out approach – Selected a set of common terms in research paper according to document frequency – Filtering out the common words in query time 11

Common terms lt propose prepare shows gt proposed prepares showing paper based preparing shown papers obtain prepared report method obtains carry reported methods obtained carries study find carrying studies found carried studying result show studied results showed 12

Mining Patent Structures • Patent: structured documents • Different fields have different impacts • Four main fields – Title, abstract, specification and claim • Specification can be divided into fours sub-fields – Background, description, summary and drawing • Experiments: – Using some of the fields – Aggregating occurrence of query terms in different fields with linear interpolation • With equal weights 13

Query Expansion • An effective technique to enrich query with terms from top-ranked documents • Pseudo-relevance feedback • Number of feedback documents and query terms is a key issue • More effective for short queries • Is it effective for the Patent Mining task (quite long query)? 14

Outline • Introduction • Our Approaches • The Issues Investigated • Experiments • Conclusion 15

Experiments • Query and document processing: in standard way – Porter stemmer – Removing stop words • Evaluation metrics – Mean average precision – Precision at top N documents ( P@N ) 16

Term Distillation Results Model P@30 P@100 MAP Original 0.0277 0.0047 0.1502 Term Distillation 0.0282 0.0046 0.1491 Does not seem to be effective. Is it due to the terms selected? 17

The Effectiveness of Query Expansion Top 20 documents #Exp. Terms P@30 P@100 MAP 0 0.0271 0.0047 0.1488 20 0.0274 0.0029 0.1470 40 0.0274 0.0030 0.1451 60 0.0277 0.0029 0.1447 80 0.0277 0.0030 0.1439 100 0.0276 0.0030 0.1456 Observation: Not very effective. Possibly due to the fact that queries (paper abstracts) are already quite long. 18

The Impact of Different Fields T: title A: abstract S: specification C: claim B: background D: description M: summary R: drawing Fields P@30 P@100 MAP T+A+S+C 0.0277 0.0047 0.1502 T+A+B 0.0270 0.0041 0.1470 T+A+B+D 0.0281 0.0049 0.1489 T+A+B+D+M 0.0276 0.0047 0.1495 No significant differences 19

The Impact of Different K Values 20

Formal Run Results rali_baseline: Title+Abstract+Specification+Claim Rali_short_doc: Title+Abstract+Description Run ID P@30 P@100 MAP rali_baseline 0.0234 0.0050 0.1423 rali_short_doc 0.0241 0.0048 0.1437 Marginal effect. Need to carry out more experiments using different fields. 21

Conclusion • Classification of research abstracts into IPC – K-NN classifier • Investigated several issues – Only the value of K has some impact on classification effectiveness – The other factors do not seem to affect the classification accuracy: • Different fields • pseudo-relevance feedback • Term distillation • Questions: – Exploiting more characteristics of patents? – Term relationships? 22

Thanks! 23

NTCIR-7 Patent Mining Experiments at RALI Guihong Cao, Jian-Yun Nie - PowerPoint PPT Presentation

NTCIR-7 Patent Mining Experiments at RALI Guihong Cao, Jian-Yun Nie and Lixin Shi Department of Computer Science and Operations Research University of Montreal, Canada 1 Outline Introduction Our Approaches Issues Investigated

Revisiting Document Length Hypotheses NTCIR-4 CLIR and Patent Experiments at Patolis 4 June 2004

KNN and re ranking models for English KNN and re-ranking models for English patent mining at

Kyoto-U: Syntactical EBMT System for NTCIR 7 Patent System for NTCIR-7 Patent Translation Task

Overview of Patent Retrieval Task at NTCIR-4 Atsushi Fujii (Univ. of Tsukuba) Makoto Iwayama

POSTECH at NTCIR-4: CJKE Monolingual and Korean-related Cross-Language Retrieval Experiments

NTCIR-9 Kick-Off Event ff 2010.10.05 : 13:30- English Session: 15:30-

I t Introduction to NTCIR-7 d ti t NTCIR 7 N Noriko Kando k K d National Institute of

Neuchatel at NTCIR-4 From CLEF to NTCIR Jacques Savoy University of Neuchatel, Switzerland

5/25/2011 Patent Reform Topics Law & economic model for understanding patent law

Technology Evolution Opportunity Mining (TEOM) for patent analysis and strategic invention

N Destination i? 108 US 8,687,777 B1 Page 2 Related U.S. Application Data 5,572.581 A 1 1/1996

c12) United States Patent US 9,521,255 Bl (10) Patent No.: Lavian et al. (45) Date of Patent:

Patent family - background Patent family - background Patent family - background 1883

c12) United States Patent US 8,345,835 Bl (10) Patent No.: Or-Bach et al. (45) Date of Patent:

c12) United States Patent US 8,548,135 Bl (10) Patent No.: Lavian et al. (45) Date of Patent:

NTCIR 2014 Slides - TUW-IMP at the NTCIR-11 Math-2 Presentation February 2015 CITATIONS READS

c12) United States Patent US 8,054,952 Bl (10) Patent No.: Or-Bach et al. (45) Date of Patent:

Overview of the Sixth NTCIR Workshop Noriko Kando National Institute of Informatics

ClefIp 2009: retrieval experiments in the Intellectual Property domain Giovanna Roda

(12) United States Patent US 8,867,708 B1 (10) Patent N0.: *oa. 21, 2014 Lavian et a]. (45)

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Overview of the 7 th NTCIR f Workshop N Noriko Kando k K d National Institute of

Educational Data Mining: Results from In Vivo Experiments to Teach Different Physics Topics

Overview of NTCIR-14 Makoto P. Kato Yiqun Liu University of Tsukuba Tsinghua University

NTCIR-7 Patent Mining Experiments at RALI Guihong Cao, Jian-Yun Nie - PowerPoint PPT Presentation

NTCIR-7 Patent Mining Experiments at RALI Guihong Cao, Jian-Yun Nie and Lixin Shi Department of Computer Science and Operations Research University of Montreal, Canada 1 Outline Introduction Our Approaches Issues Investigated

Revisiting Document Length Hypotheses NTCIR-4 CLIR and Patent Experiments at Patolis 4 June 2004

KNN and re ranking models for English KNN and re-ranking models for English patent mining at

Kyoto-U: Syntactical EBMT System for NTCIR 7 Patent System for NTCIR-7 Patent Translation Task

Overview of Patent Retrieval Task at NTCIR-4 Atsushi Fujii (Univ. of Tsukuba) Makoto Iwayama

POSTECH at NTCIR-4: CJKE Monolingual and Korean-related Cross-Language Retrieval Experiments

NTCIR-9 Kick-Off Event ff 2010.10.05 : 13:30- English Session: 15:30-

I t Introduction to NTCIR-7 d ti t NTCIR 7 N Noriko Kando k K d National Institute of

Neuchatel at NTCIR-4 From CLEF to NTCIR Jacques Savoy University of Neuchatel, Switzerland

5/25/2011 Patent Reform Topics Law &amp; economic model for understanding patent law

Technology Evolution Opportunity Mining (TEOM) for patent analysis and strategic invention

N Destination i? 108 US 8,687,777 B1 Page 2 Related U.S. Application Data 5,572.581 A 1 1/1996

c12) United States Patent US 9,521,255 Bl (10) Patent No.: Lavian et al. (45) Date of Patent:

Patent family - background Patent family - background Patent family - background 1883

c12) United States Patent US 8,345,835 Bl (10) Patent No.: Or-Bach et al. (45) Date of Patent:

c12) United States Patent US 8,548,135 Bl (10) Patent No.: Lavian et al. (45) Date of Patent:

NTCIR 2014 Slides - TUW-IMP at the NTCIR-11 Math-2 Presentation February 2015 CITATIONS READS

c12) United States Patent US 8,054,952 Bl (10) Patent No.: Or-Bach et al. (45) Date of Patent:

Overview of the Sixth NTCIR Workshop Noriko Kando National Institute of Informatics

ClefIp 2009: retrieval experiments in the Intellectual Property domain Giovanna Roda

(12) United States Patent US 8,867,708 B1 (10) Patent N0.: *oa. 21, 2014 Lavian et a]. (45)

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Overview of the 7 th NTCIR f Workshop N Noriko Kando k K d National Institute of

Educational Data Mining: Results from In Vivo Experiments to Teach Different Physics Topics

Overview of NTCIR-14 Makoto P. Kato Yiqun Liu University of Tsukuba Tsinghua University

5/25/2011 Patent Reform Topics Law & economic model for understanding patent law