An Approach to Automated Thesaurus Construction Using Clusterization-Based Dictionary Analysis Nadezhda Lagutina P.G. Demidov Yaroslavl State University Yaroslavl, Russia
Thesaurus definition Thesaurus is a vocabulary of controlled indexing language, formally organized so that a priori relationships between concepts are made explicit . [J.Aitchison,A.Gilchrist, D.Bawden. Thesaurus construction and use: a practical manual] Nadezhda Lagutina 2
Thesaurus purpose Indexing of documents using concepts from semantic resources and enhancement of results of the user’s search Classification and division of documents into clusters Nadezhda Lagutina 3
Thesaurus construction Domain Clusters Terms Hierarchical relations Associative relations Nadezhda Lagutina 4
Disadvantages of manual thesaurus-making High cost Long duration Restrictions of manual analysis of the large text corpus Nadezhda Lagutina 5
Automated approach for thesaurus construction Preliminary processing of the text corpus Automatic generation of a set of candidate terms Correction of the set resulted from the the previous step by the expert Automatic clustering of the terms into the clusters. Estimation of clustering results by the expert Establishment of the semantic relations between the terms by the expert Nadezhda Lagutina 6
Example of thesaurus construction Domain: Cardiology Dictionary: Online Stedman’s Medical Dictionary Key words: heart, -card-, valv-, vessel, trunk, vascular, vein, artery, aorta, atrium, ventric-, block, hypertension, hypotension Nadezhda Lagutina 7
Automatic generation of the candidate term list Key word search by full words by word morphemes Quantitative characteristics number of key words (morphemes) in a dictionary entry (absolute frequency) percentage of key words (morphemes) in a dictionary entry (relative frequency) Nadezhda Lagutina 8
Candidate term list Stedman’s Medical Dictionary: 100 000 terms Candidate term list: 2 039 terms Example: trocho card ia . A rotary displacement of the heart around its axis. Nadezhda Lagutina 9
Clustering The CLOPE algorithm [Y. Yang, X. Guan, J. You. CLOPE: a fast and effective clustering algorithm for transactional data] T = {t 1 , t 2 ,...,t n } – set of transactions t i = {x i1 , x i2 ,...} – dictionary entry x ij – word Clustering C = {C 1 , . . . ,C k } ∑ S ( C i ) r × ∣ C i ∣ ∣ D ( C i ) ∣ Profit r (C) = → max ∑ ∣ C i ∣ Nadezhda Lagutina 10
Small cluster (<10 terms) dexiocardia / dextrocardia, pericardium / heart-sac, sphygmocardiocsope / sphygmocardiograph, bradycardia/ bradycardia / brachycardia / areocardia / araiocardia Nadezhda Lagutina 11
Larger cluster (> 50 terms) [Standard anatomy of heart and blood vessels]: interatrial, interventricular, intravascular, intra-atrial, endocardium, intravenous, intramyocardial, periatrial, …, [Pathology of heart and blood vessels]: phlebocholosis, phlebectasia, vasoconstriction, cardiopalmus, cardiomegaly, capillarectasia, …, [Standard anatomy of heart and blood vessels]: angiogenesis, intra-auricular, …, [Tools and instruments]: hleborrhaphy, venesuture, … , [Pathology of heart and blood vessels]: telangiitis, cardiodynia, angiocarditis, omphalophlebitis Nadezhda Lagutina 12
Semantic clusters Standard anatomy of heart and blood vessels Standard physiology of heart and blood vessels Pathology of heart and blood vessels Tools and instruments Pharmacology Surgical intervention and manipulations Nadezhda Lagutina 13
Results The proposed approach allows to construct the thesaurus corpus that adequately represents the target domain The clustering results simplify the expert’s work because the the terms from the same area are usually follow one another in the clusters Nadezhda Lagutina 14
Recommend
More recommend