invidenti author disambiguation for
play

InvIdenti: Author Disambiguation for 28 July 2016 Slide 1 Medical - PowerPoint PPT Presentation

InvIdenti: Author Disambiguation for Medical Patents Bachelor Thesis Presentation Sanchit Alekh InvIdenti: Author Disambiguation for 28 July 2016 Slide 1 Medical Patents Guide (IIIT-A) : Prof. Dr. U.S. Tiwary Guide (RWTH Aachen) : PD Dr.


  1. InvIdenti: Author Disambiguation for Medical Patents Bachelor Thesis Presentation Sanchit Alekh InvIdenti: Author Disambiguation for 28 July 2016 Slide 1 Medical Patents Guide (IIIT-A) : Prof. Dr. U.S. Tiwary Guide (RWTH Aachen) : PD Dr. Christoph Quix Enrolment : IIT2012108 Email : iit2012108@iiita.ac.in / alekh@dbis.rwth-aachen.de Prof. M. Jarke Lehrstühl Informatik 5 RWTH Aachen

  2. InvIdenti: Contents Author Disambiguation for Medical Patents 1. Introduction and Goals Sanchit Alekh 28 July 2016 2. Background Slide 2 3. Approach and Solution 4. Evaluation 5. Conclusion 6. Scope for Future Work Prof. M. Jarke Lehrstühl Informatik 5 RWTH Aachen

  3. InvIdenti: Introduction and Goals Author Disambiguation for Medical Patents 1. What and Why? Sanchit Alekh 28 July 2016 • Author Disambiguation: Distinguish between inventors with Slide 3 same or similar names / competence fields • Identifying by name has severe limitations • Spelling errors in patent database introduce ambiguity • Authors/Inventors may share name and/or expertise area • Manual Approaches infeasible and not future-proof due to explosion in number of patents Prof. M. Jarke Lehrstühl Informatik 5 RWTH Aachen

  4. InvIdenti: Introduction and Goals Author Disambiguation for Medical Patents 2. Software Functionality Goals Sanchit Alekh • Feature Selection : Find good and representative features for 28 July 2016 Slide 4 the disambiguation task • Importance Weighting of Features • Similarity Calculation • Patent Clustering • Patent-Publication Matching Prof. M. Jarke Lehrstühl Informatik 5 RWTH Aachen

  5. InvIdenti: Introduction and Goals Author Disambiguation for Medical Patents 3. Software Quality Goals Sanchit Alekh • Software Design and Architecture : Software should conform 28 July 2016 Slide 5 to S.O.L.I.D principles for code maintainability and possibility of future extension • Support for Parallelization & Multiprocessor Architecture • Lucid Documentation for long-term maintainability • UML Diagrams • JavaDoc™ Documentation • Wiki Pages Prof. M. Jarke Lehrstühl Informatik 5 RWTH Aachen

  6. InvIdenti: Background Author Disambiguation for Medical Patents 1. Project Mi-Mappa • Complex innovation in medical engineering not possible without Sanchit Alekh 28 July 2016 collaboration Slide 6 • Goal is to develop an integrative competence model based on Data Mining Algorithms • Assignment of patents and medical products to competence fields • Actors selected based on published texts for a given project • Use of Ontology Modeling and matching, Data and Text Mining Prof. M. Jarke Lehrstühl Informatik 5 RWTH Aachen

  7. InvIdenti: Background Author Disambiguation for Medical Patents 2. Related Work • PatentsView Inventor Disambiguation Workshop – Sept. ‘15 Sanchit Alekh Neural Networks, Rule-based methods, Ensemble ML 28 July 2016 Slide 7 Methods for Inventor Disambiguation • [Fleming et al. 2014] Disambiguation and Co-Authorship Networks of the US Patent Inventor Database(1975-2010) Uses a Naïve Bayesian Classifier Technique with Blocking • [Maraut et al. 2014] Identifying author-inventors from Spain Computes a global similarity and clusters inventors based on Prof. M. Jarke that Lehrstühl Informatik 5 RWTH Aachen

  8. InvIdenti: Solution: Outline Author Disambiguation for Medical 1. Underlying data-structure used is an Inventor-Patent Instance, which Patents stores the metadata as well as textual features Sanchit Alekh 2. An Assortment of 10 features is used, out of which there are 6 28 July 2016 metadata and 4 textual features Slide 8 3. Different Feature Similarity metrics are used for each of the features to compute a weighted similarity matrix between instance pairs 4. Weight Training is done using pre-labelled instances from dataset provided by Fleming et al. using Logistic Regression Prof. M. Jarke 5. Hierarchical Clustering and DBSCAN are used to assign inventor- Lehrstühl patent instances to clusters with unique inventors Informatik 5 RWTH Aachen

  9. InvIdenti: Solution: Flow Author Disambiguation for Medical Patents Sanchit Alekh 28 July 2016 Slide 9 Prof. M. Jarke Lehrstühl Fig. 9.1 Flowchart of processes involved in InvIdenti Informatik 5 RWTH Aachen

  10. InvIdenti: Solution: Inventor- Author Disambiguation Patent Instance for Medical Patents Sanchit Alekh 28 July 2016 Slide 10 Prof. M. Jarke Fig. 10.1 Inventor Patent Instances Fig. 10.2 Ten features used to represent Lehrstühl Informatik 5 obtained from Patent X the Inventor-Patent Instance RWTH Aachen

  11. InvIdenti: Solution: Similarity Author Disambiguation for Medical Patents • Feature Similarity Techniques 1. Name : Levenshtein Distance 2. Location : Country + Distance (from Latitude & Longitude) Sanchit Alekh 28 July 2016 3. Assignee : Assignee Code + Levenshtein Distance of Ass. Name Slide 11 4. Technology Class : Number of shared classes 5. Co-Inventors : Number of Shared Co-Inventors 6. Textual Features : Cosine Similarity between Document Vectors Prof. M. Jarke Lehrstühl Informatik 5 Fig. 11.1 Feature Similarity Calculation for Location, Co-Author and Textual Features RWTH Aachen

  12. InvIdenti: Solution: Similarity Author Disambiguation for Medical Patents • Feature Similarity Transformations 1. Distance Measures are converted to Similarity Measures Sanchit Alekh 2. All Similarity values are normalized to fall within range [0,1] 28 July 2016 Slide 12 • Global Similarity - S global = ∑ 1. w i S i , where w i are feature weights and S i are the ./0 normalized similarity values 2. Threshold : 𝜗 • How to find suitable values for weights and threshold? v Logistic Regression Prof. M. Jarke Lehrstühl Informatik 5 RWTH Aachen

  13. InvIdenti: Solution: Logistic Regression Author Disambiguation for Medical Patents • Maximum Log-likelihood is used to model the Probability 𝑄 𝑍 = 1 𝑌 = 𝑦) based on binary output variable Y ∈ {0,1} Sanchit Alekh 28 July 2016 • The Logistic (or Logit) Function is used to model this probability as Slide 13 it is bounded in both directions. The equation is: On solving for 𝑄 𝑍 = 1 𝑌 = 𝑦) , we get the Sigmoid Function • Prof. M. Jarke Lehrstühl Informatik 5 RWTH Aachen

  14. InvIdenti: Solution: Logistic Regression Author Disambiguation for Medical Patents • Using Logistic Regression, aim is to train the model on labelled data to obtain weights and threshold Sanchit Alekh 28 July 2016 - We can say that there is a match or no match if ∑ w i x i • is greater ./0 Slide 14 than or less than 𝜗 respectively • For training, there must be a cost function associated with the sigmoid function. The cost function follows a –ve log form, and is given by: Prof. M. Jarke Lehrstühl Informatik 5 RWTH Aachen

  15. InvIdenti: Solution: Logistic Regression Author Disambiguation for Medical Patents • For training in Logistic Regression, classic Gradient Descent method is used, i.e. error correction is made by a factor of the gradient of the cost function Sanchit Alekh 28 July 2016 Slide 15 • Therefore, the weight update of each parameter after every iteration of Gradient Descent is given by Where α is the learning rate Prof. M. Jarke Lehrstühl Informatik 5 RWTH Aachen

  16. InvIdenti: Solution: Transitivity Author Disambiguation for Medical Simple Binary Classification using Logistic Regression does not yield good • Patents results. Why? 1. Many inventors cover several expertise areas Sanchit Alekh 28 July 2016 2. Inventors may change their location or organization/university Slide 16 3. Logistic Regression often suffers from overfitting. To remedy this, we propose that additional property, i.e. Transitivity be • fulfilled by patents. Prof. M. Jarke Lehrstühl Informatik 5 RWTH Aachen

  17. InvIdenti: Solution: Transitivity Author Disambiguation for Medical Patents • In InvIdenti, Transitivity is affected by Clustering Algorithms, i.e. Hierarchical Clustering and DBSCAN. Sanchit Alekh 28 July 2016 • In Hierarchical Clustering, the type of linkage method used Slide 17 controls the extent of transitivity 1. Single-Linkage : Promotes chaining; best transitivity 2. Complete-Linkage : Avoids chaining; worst transitivity 3. Group-Average Linkage : Medium Transitivity • In DBSCAN, the parameter MinPts determines the extent of transitivity. MinPts = 1 guarantees chaining Prof. M. Jarke Lehrstühl and best transitivity Informatik 5 RWTH Aachen

  18. Solution: Hierarchical InvIdenti: Author Clustering Disambiguation for Medical Patents • Hierarchical Agglomerative Clustering starts with each patent in a different cluster, and then merges successfully based on the best similarity values Sanchit Alekh 28 July 2016 Slide 18 • The Stopping Criterion used is the threshold obtained from Logistic Regression. • We employ Single-linkage clustering, which uses the best similarity value between clusters to merge them. • When cluster similarities are less than Prof. M. Jarke the threshold, merge process is stopped Lehrstühl Informatik 5 RWTH Aachen

Recommend


More recommend