Automatic Identification of Document Translations in Large Multilingual Document Collections Automatic Identification of Document Translations in Large Multilingual Document Collections RANLP Conference, Borovets, Bulgaria 11 September 2003 Bruno Pouliquen, Ralf Steinberger & Camelia Ignat Joint Research Centre, Ispra, Italy http://www.jrc.it/langtech RANLP'2003, Bulgaria, 11.09.03 In a Nutshell Eurovoc thesaurus descriptors, here displayed in English 6621020304 52160104 Spanish Spanish English English Text Text Text Text Resolución Resolution sobre los on radio- residuos active waste radioactivos RANLP'2003, Bulgaria, 11.09.03 EC - Joint Research Centre - IPSC --- Ralf Steinberger 1
Automatic Identification of Document Translations in Large Multilingual Document Collections Agenda � Who we are and what we do � Eurovoc Thesaurus � Automatic assignment of thesaurus descriptors to text � Training Phase � Assignment Phase � Document Similarity Calculation and Translation Identification � Application Areas ot the Technology RANLP'2003, Bulgaria, 11.09.03 Goal of JRC’s Language Technology work IDoRA System: Intelligent Document Retrieval and Analysis � Retrieval of potentially relevant texts � Text analysis and extraction of information from texts � Visualisation of the contents RANLP'2003, Bulgaria, 11.09.03 EC - Joint Research Centre - IPSC --- Ralf Steinberger 2
Automatic Identification of Document Translations in Large Multilingual Document Collections Focus of JRC’s Language Technology work � Multilingual and cross-lingual applications � Also for languages of EU Candidate Countries � Many languages; few human resources � Applications using more statistics and less language-specific resources RANLP'2003, Bulgaria, 11.09.03 Eurovoc Thesaurus http://europa.eu.int/celex/eurovoc � Multilingual list of terms about many different subject areas (wide coverage) � Developed by the European Parliament (EP) and others � Actively used to index (catalogue) and retrieve documents in large collections (fine-grained classification and cataloguing system) � Hierarchically organised into a maximum of 8 levels � top level: 21 fields � next level: 127 micro-thesauri � total: 5933 descriptors (version 3.0) 5877 reciprocal relations (BT, NT) � 2730 reciprocal associations (RT) � RANLP'2003, Bulgaria, 11.09.03 EC - Joint Research Centre - IPSC --- Ralf Steinberger 3
Automatic Identification of Document Translations in Large Multilingual Document Collections Eurovoc (Top Level and Detail) 04 Politics 28 SOCIAL QUESTIONS 2806 family 08 International Relations 2811 migration 10 European Communities 2816 demography and population 12 Law 2821 social framework 16 Economics 2826 social affairs 20 Trade 2831 culture and religion 24 Finance arts 28 Social Questions cultural policy 32 Education and Competition culture 36 Science acculturation civilization 40 Business and Competition cultural difference 44 Employment and Working Conditions cultural identity 48 Transport RT: protection of minorities (1236) 52 Environment RT: socio-cultural group (2821) 56 Agriculture, Forestry and Fisheries cultural pluralism 60 Agri-Foodstuffs popular culture 64 Production, Technology and Research regional culture 66 Energy religion 68 Industry 2836 social protection 2841 health 72 Geography 2846 construction and town planning 76 International Organisations RANLP'2003, Bulgaria, 11.09.03 Eurovoc Users Documentation Centres and Libraries of: C z e c h R e p u b l i c � � Chamber of Deputies � European Paliament � Euro Info Centre � European Documentation Centre � DG OPOCE � Info Centre of the EU � Belgium : � Supreme Audit Office � Senate � Parliamentary Library � La Chambre L i t h u a n i a n S e i m a s � P o l i s h S e j m � Portugal : Assambleia da Republica � S l o v e n i a n D r ž a v n i z b o r � � Sweden : Riksdag R o m a n i a n C a m e r a D e p u t a t i l o r � � Spain : R u s s i a n D u m a � � El Senado A l b a n i a n P a r l i a m e n t � � Congreso de los Diputados C r o a t i a � U k r a i n e � Switzerland : Assemblée Fédérale � RANLP'2003, Bulgaria, 11.09.03 EC - Joint Research Centre - IPSC --- Ralf Steinberger 4
Automatic Identification of Document Translations in Large Multilingual Document Collections Eurovoc Languages � Used by the EP and DG OPOCE for all 11 official EU languages � Also exists for: Albanian, Czech, Croatian, Hungarian, Latvian, Lithuanian, Polish, Romanian, Russian, Slovak, Slovenian � Consider using Eurovoc: Armenia, Bosnia-Herzegovina, Bulgaria, Estonia, France, Georgia, Iceland , Macedonia, Turkey � Most multilingual thesaurus in existence? (currently 22 languages) RANLP'2003, Bulgaria, 11.09.03 Automatic Indexing: Challenge � Descriptors are mostly abstract multi-word concepts, e.g. � PROTECTION OF MINORITIES � FISHERY MANAGEMENT � CONSTRUCTION AND TOWN PLANNING � SIMPLIFICATION OF FORMALITIES � PLUTONIUM � FRANCE � Searching for descriptors (baseline) in text is not a solution: Maximum recall ~ 30%, Maximum precision ~ 7% � Keyword Assignment as opposed to keyword extraction RANLP'2003, Bulgaria, 11.09.03 EC - Joint Research Centre - IPSC --- Ralf Steinberger 5
Automatic Identification of Document Translations in Large Multilingual Document Collections JRC's Statistical / Associative Approach FISHERY MANAGEMENT Training Phase: Identify many 1. (statistically or semantically) related words ( associates ) Assignment phase: Assign descriptor if 2. many of its associates are present in text. RANLP'2003, Bulgaria, 11.09.03 Training: Text Normalisation � Linguistic pre-processing = normalisation of the text Lemmatisation (base-form reduction of words) and lower-casing: � Transporting � transport Mark-up of multi-word expressions � 'plant' � 'green_plant' vs. 'power_plant' Stop word lists to avoid words that are not content-bearing � general: are, they, having, in_spite_of, interesting, domain-specific: question, answer, commission, article RANLP'2003, Bulgaria, 11.09.03 EC - Joint Research Centre - IPSC --- Ralf Steinberger 6
Automatic Identification of Document Translations in Large Multilingual Document Collections Training: Produce Associate Lists � Using a large collection of manually indexed documents (training corpus) � For each descriptor D 1 , take all documents indexed with D 1 � identify the statistically salient words in each of these texts � join these lists of statistically salient words, e.g. RADIOACTIVE MATERIALS radioactive plutonium Illegal_traffic radioactive (3) ukraine deuterium chernobyl plutonium (3) resolution assembly radioactive nuclear (2) plutonium nuclear ukrainian = + + deuterium (2) deuterium schmidt plutonium parliament radioactive lithium Illegal_traffic (1) nuclear korea dangerous chernobyl (1) blottnitz iaea mox ... ... ... ... � Normalise the weight according to a number of different criteria. � Result of Training : Weighed associate lists for all descriptors RANLP'2003, Bulgaria, 11.09.03 Associate List: RADIOACTIVE MATERIALS RANLP'2003, Bulgaria, 11.09.03 EC - Joint Research Centre - IPSC --- Ralf Steinberger 7
Automatic Identification of Document Translations in Large Multilingual Document Collections Associate List: FISHERY MANAGEMENT fishery-related management-related RANLP'2003, Bulgaria, 11.09.03 Assignment Phase � Normalise new document (lemmatise, multi-word mark-up) � Produce lemma frequency list ... (excluding stop words) Calculate similarity � between lemma frequency list and descriptor associate lists, using statistical formulae TFIDF . TFIDF � l , d l , t COSINE ( d , t ) = l ∈ d ∩ t 2 2 ( TFIDF ).( TFIDF ) � � l , d l , t l ∈ d l ∈ t RANLP'2003, Bulgaria, 11.09.03 EC - Joint Research Centre - IPSC --- Ralf Steinberger 8
Automatic Identification of Document Translations in Large Multilingual Document Collections Formulae tested for descriptor assignment Term Frequency, Inverse Document Frequency Considers occurrence frequency N TFIDF = TF .((log 2 ) + 1 ) of lemma (l) in meta-text (TF l,t ) and number of l , d l , d DF descriptors (d) for which the lemma is an l associate (DF l ) TFIDF . TFIDF � Cosine uses TF.IDF; computes the angle of two l , d l , t l ∈ d ∩ t COSINE ( d , t ) = multi-dimensional vectors (of the document (t) 2 2 and of the descriptor associate list) ( TFIDF ).( TFIDF ) � � l , d l , t l ∈ d l ∈ t Okapi considers occurrence frequency of TF N − DF lemma as an associate (DF l ); the number of l , d l Okapi log( ) = � t , d associates in the associate list (size, |d|); the d DF l ∈ t ∩ d l average size of descriptor associate lists (M); TF + l , d M the total number of descriptors used (N) ‘Scalar Product’ adds product of TF.IDF values Sproduct ( d , t ) = TFIDF , . TFIDF � l d l , t of associates and text lemmas l ∈ d ∩ t COSINE Okapi Sproduct ‘622’ mixed formula, uses all of the above Φ = 0 . 61 + 0 . 21 + 0 . 18 max( COSINE ) max( Okapi ) max( Sproduct ) RANLP'2003, Bulgaria, 11.09.03 Manual Evaluation of the Assignment RANLP'2003, Bulgaria, 11.09.03 EC - Joint Research Centre - IPSC --- Ralf Steinberger 9
Recommend
More recommend