an approach to automated thesaurus construction using
play

An Approach to Automated Thesaurus Construction Using - PowerPoint PPT Presentation

An Approach to Automated Thesaurus Construction Using Clusterization-Based Dictionary Analysis Nadezhda Lagutina P.G. Demidov Yaroslavl State University Yaroslavl, Russia Thesaurus definition Thesaurus is a vocabulary of controlled


  1. An Approach to Automated Thesaurus Construction Using Clusterization-Based Dictionary Analysis Nadezhda Lagutina P.G. Demidov Yaroslavl State University Yaroslavl, Russia

  2. Thesaurus definition Thesaurus is a vocabulary of controlled indexing language, formally organized so that a priori relationships between concepts are made explicit . [J.Aitchison,A.Gilchrist, D.Bawden. Thesaurus construction and use: a practical manual] Nadezhda Lagutina 2

  3. Thesaurus purpose  Indexing of documents using concepts from semantic resources and enhancement of results of the user’s search  Classification and division of documents into clusters Nadezhda Lagutina 3

  4. Thesaurus construction  Domain  Clusters  Terms  Hierarchical relations  Associative relations Nadezhda Lagutina 4

  5. Disadvantages of manual thesaurus-making  High cost  Long duration  Restrictions of manual analysis of the large text corpus Nadezhda Lagutina 5

  6. Automated approach for thesaurus construction  Preliminary processing of the text corpus  Automatic generation of a set of candidate terms  Correction of the set resulted from the the previous step by the expert  Automatic clustering of the terms into the clusters.  Estimation of clustering results by the expert  Establishment of the semantic relations between the terms by the expert Nadezhda Lagutina 6

  7. Example of thesaurus construction  Domain: Cardiology  Dictionary: Online Stedman’s Medical Dictionary  Key words: heart, -card-, valv-, vessel, trunk, vascular, vein, artery, aorta, atrium, ventric-, block, hypertension, hypotension Nadezhda Lagutina 7

  8. Automatic generation of the candidate term list Key word search  by full words  by word morphemes Quantitative characteristics  number of key words (morphemes) in a dictionary entry (absolute frequency)  percentage of key words (morphemes) in a dictionary entry (relative frequency) Nadezhda Lagutina 8

  9. Candidate term list  Stedman’s Medical Dictionary: 100 000 terms  Candidate term list: 2 039 terms  Example: trocho card ia . A rotary displacement of the heart around its axis. Nadezhda Lagutina 9

  10. Clustering The CLOPE algorithm [Y. Yang, X. Guan, J. You. CLOPE: a fast and effective clustering algorithm for transactional data] T = {t 1 , t 2 ,...,t n } – set of transactions t i = {x i1 , x i2 ,...} – dictionary entry x ij – word Clustering C = {C 1 , . . . ,C k } ∑ S ( C i ) r × ∣ C i ∣ ∣ D ( C i ) ∣ Profit r (C) = → max ∑ ∣ C i ∣ Nadezhda Lagutina 10

  11. Small cluster (<10 terms) dexiocardia / dextrocardia, pericardium / heart-sac, sphygmocardiocsope / sphygmocardiograph, bradycardia/ bradycardia / brachycardia / areocardia / araiocardia Nadezhda Lagutina 11

  12. Larger cluster (> 50 terms) [Standard anatomy of heart and blood vessels]: interatrial, interventricular, intravascular, intra-atrial, endocardium, intravenous, intramyocardial, periatrial, …, [Pathology of heart and blood vessels]: phlebocholosis, phlebectasia, vasoconstriction, cardiopalmus, cardiomegaly, capillarectasia, …, [Standard anatomy of heart and blood vessels]: angiogenesis, intra-auricular, …, [Tools and instruments]: hleborrhaphy, venesuture, … , [Pathology of heart and blood vessels]: telangiitis, cardiodynia, angiocarditis, omphalophlebitis Nadezhda Lagutina 12

  13. Semantic clusters  Standard anatomy of heart and blood vessels  Standard physiology of heart and blood vessels  Pathology of heart and blood vessels  Tools and instruments  Pharmacology  Surgical intervention and manipulations Nadezhda Lagutina 13

  14. Results  The proposed approach allows to construct the thesaurus corpus that adequately represents the target domain  The clustering results simplify the expert’s work because the the terms from the same area are usually follow one another in the clusters Nadezhda Lagutina 14

Recommend


More recommend