Term and Collocation Extraction by means of complex Linguistic Web - PowerPoint PPT Presentation

Term and Collocation Extraction by means of complex Linguistic Web Services Ulrich Heid, Fabienne Fritzinger, Erhard Hinrichs, Marie Hinrichs, Thomas Zastrow Institut f¨ ur maschinelle Sprachverarbeitung, Universit¨ at Stuttgart and Seminar f¨ ur Sprachwissenschaft, Universit¨ at T¨ ubingen Germany Linguistic Resources and Evaluation Conference, 2010: Valletta, Malta Heid et al. (Stuttgart/T¨ ubingen) D-SPIN Extraction WebServices LREC 2010 1 / 16

Overview Heid et al. (Stuttgart/T¨ ubingen) D-SPIN Extraction WebServices LREC 2010 2 / 16

Overview • Objectives and scenarios addressed Heid et al. (Stuttgart/T¨ ubingen) D-SPIN Extraction WebServices LREC 2010 2 / 16

Overview • Objectives and scenarios addressed • Data used for experimentation Heid et al. (Stuttgart/T¨ ubingen) D-SPIN Extraction WebServices LREC 2010 2 / 16

Overview • Objectives and scenarios addressed • Data used for experimentation • Procedures to extract single word term candidates Heid et al. (Stuttgart/T¨ ubingen) D-SPIN Extraction WebServices LREC 2010 2 / 16

Overview • Objectives and scenarios addressed • Data used for experimentation • Procedures to extract single word term candidates • Procedures to extract collocation candidates Heid et al. (Stuttgart/T¨ ubingen) D-SPIN Extraction WebServices LREC 2010 2 / 16

Overview • Objectives and scenarios addressed • Data used for experimentation • Procedures to extract single word term candidates • Procedures to extract collocation candidates • Combining the tools for both extraction tasks Heid et al. (Stuttgart/T¨ ubingen) D-SPIN Extraction WebServices LREC 2010 2 / 16

Overview • Objectives and scenarios addressed • Data used for experimentation • Procedures to extract single word term candidates • Procedures to extract collocation candidates • Combining the tools for both extraction tasks • The extraction as a web service: Architecture – technical issues addressed – open questions Heid et al. (Stuttgart/T¨ ubingen) D-SPIN Extraction WebServices LREC 2010 2 / 16

Overview • Objectives and scenarios addressed • Data used for experimentation • Procedures to extract single word term candidates • Procedures to extract collocation candidates • Combining the tools for both extraction tasks • The extraction as a web service: Architecture – technical issues addressed – open questions • Conclusion – Future Work Heid et al. (Stuttgart/T¨ ubingen) D-SPIN Extraction WebServices LREC 2010 2 / 16

Objectives Heid et al. (Stuttgart/T¨ ubingen) D-SPIN Extraction WebServices LREC 2010 3 / 16

Objectives • Provision of computational linguistic tools for • Term candidate extraction • Collocation candidate extraction • Extraction of regionalism candidates Heid et al. (Stuttgart/T¨ ubingen) D-SPIN Extraction WebServices LREC 2010 3 / 16

Objectives • Provision of computational linguistic tools for • Term candidate extraction • Collocation candidate extraction • Extraction of regionalism candidates • Tools based on standard corpus processing techniques: Tagging – parsing – pattern-based extraction – lexicostatistics Heid et al. (Stuttgart/T¨ ubingen) D-SPIN Extraction WebServices LREC 2010 3 / 16

Objectives • Provision of computational linguistic tools for • Term candidate extraction • Collocation candidate extraction • Extraction of regionalism candidates • Tools based on standard corpus processing techniques: Tagging – parsing – pattern-based extraction – lexicostatistics • Tools wrapped and provided as chains of web services: • to assess possibilities of creating complex linguistic web services • to test the processing of non-trivial amounts of data via web services Heid et al. (Stuttgart/T¨ ubingen) D-SPIN Extraction WebServices LREC 2010 3 / 16

Scenarios addressed Heid et al. (Stuttgart/T¨ ubingen) D-SPIN Extraction WebServices LREC 2010 4 / 16

Scenarios addressed • Type I: single word term candidate extraction • to find specialilzed terms of a specific domain of knowledge • to find lexical material specific of a given region: German of: Germany – Austria – Switzerland – South Tyrol Heid et al. (Stuttgart/T¨ ubingen) D-SPIN Extraction WebServices LREC 2010 4 / 16

Scenarios addressed • Type I: single word term candidate extraction • to find specialilzed terms of a specific domain of knowledge • to find lexical material specific of a given region: German of: Germany – Austria – Switzerland – South Tyrol • Type II: extraction of multiword expressions (MWEs) • to find collocations (cf. Weller & Heid, this session ) • to find multiword terms and phraseology of specialized domains • to find collocations typical of a “region” (D – A – CH – ST) Heid et al. (Stuttgart/T¨ ubingen) D-SPIN Extraction WebServices LREC 2010 4 / 16

Data used in the experiments Work on German texts Heid et al. (Stuttgart/T¨ ubingen) D-SPIN Extraction WebServices LREC 2010 5 / 16

Data used in the experiments Work on German texts • General Language: newspaper texts • Frankfurter Rundschau (1992/1993) 40 M • Frankfurter Allgemeine Zeitung (1995 - 1998) 78 M • Die Zeit (1999 - 2005) 50 M • Stuttgarter Zeitung (1992/1993) 36 M • Handelsblatt (1995 - 1998) 50 M • total newspapers ca. 254 M Heid et al. (Stuttgart/T¨ ubingen) D-SPIN Extraction WebServices LREC 2010 5 / 16

Data used in the experiments Work on German texts • General Language: newspaper texts • Frankfurter Rundschau (1992/1993) 40 M • Frankfurter Allgemeine Zeitung (1995 - 1998) 78 M • Die Zeit (1999 - 2005) 50 M • Stuttgarter Zeitung (1992/1993) 36 M • Handelsblatt (1995 - 1998) 50 M • total newspapers ca. 254 M • Specialized language (taken from the OPUS Website): • European Medecine Agency (EMEA): pharmaceuticals tests 10 M Heid et al. (Stuttgart/T¨ ubingen) D-SPIN Extraction WebServices LREC 2010 5 / 16

Data used in the experiments Work on German texts • General Language: newspaper texts • Frankfurter Rundschau (1992/1993) 40 M • Frankfurter Allgemeine Zeitung (1995 - 1998) 78 M • Die Zeit (1999 - 2005) 50 M • Stuttgarter Zeitung (1992/1993) 36 M • Handelsblatt (1995 - 1998) 50 M • total newspapers ca. 254 M • Specialized language (taken from the OPUS Website): • European Medecine Agency (EMEA): pharmaceuticals tests 10 M • National or regional variants of German: • Austria (excerpts from the DeReKo corpus of IdS Mannheim) 180 M • Switzerland (dito: DeReKo) 180 M • South Tyrol (Eurac/Athesia publishers) ca. 60 M Heid et al. (Stuttgart/T¨ ubingen) D-SPIN Extraction WebServices LREC 2010 5 / 16

Procedures for single word term candidate extraction Based of relative frequency relationships “Weirdness scores” Ahmad et al. 1992 Heid et al. (Stuttgart/T¨ ubingen) D-SPIN Extraction WebServices LREC 2010 6 / 16

Procedures for single word term candidate extraction Based of relative frequency relationships “Weirdness scores” Ahmad et al. 1992 • Intuition: Terms from a domain are more frequent in domain-specific texts than elsewhere Heid et al. (Stuttgart/T¨ ubingen) D-SPIN Extraction WebServices LREC 2010 6 / 16

Procedures for single word term candidate extraction Based of relative frequency relationships “Weirdness scores” Ahmad et al. 1992 • Intuition: Terms from a domain are more frequent in domain-specific texts than elsewhere • Calculation: for each noun, verb, adjective from the specialized text: Heid et al. (Stuttgart/T¨ ubingen) D-SPIN Extraction WebServices LREC 2010 6 / 16

Procedures for single word term candidate extraction Based of relative frequency relationships “Weirdness scores” Ahmad et al. 1992 • Intuition: Terms from a domain are more frequent in domain-specific texts than elsewhere • Calculation: for each noun, verb, adjective from the specialized text: • RS: Relative frequency in the specialized text: number of occurrences / corpus size (by POS) of the specialized text Heid et al. (Stuttgart/T¨ ubingen) D-SPIN Extraction WebServices LREC 2010 6 / 16

Procedures for single word term candidate extraction Based of relative frequency relationships “Weirdness scores” Ahmad et al. 1992 • Intuition: Terms from a domain are more frequent in domain-specific texts than elsewhere • Calculation: for each noun, verb, adjective from the specialized text: • RS: Relative frequency in the specialized text: number of occurrences / corpus size (by POS) of the specialized text • RG: Relative frequency of the same item in general language text: newspapers taken to be without bias for a given domain Heid et al. (Stuttgart/T¨ ubingen) D-SPIN Extraction WebServices LREC 2010 6 / 16

Procedures for single word term candidate extraction Based of relative frequency relationships “Weirdness scores” Ahmad et al. 1992 • Intuition: Terms from a domain are more frequent in domain-specific texts than elsewhere • Calculation: for each noun, verb, adjective from the specialized text: • RS: Relative frequency in the specialized text: number of occurrences / corpus size (by POS) of the specialized text • RG: Relative frequency of the same item in general language text: newspapers taken to be without bias for a given domain • Relationship RS/RG Heid et al. (Stuttgart/T¨ ubingen) D-SPIN Extraction WebServices LREC 2010 6 / 16

Term and Collocation Extraction by means of complex Linguistic Web - PowerPoint PPT Presentation

Term and Collocation Extraction by means of complex Linguistic Web Services Ulrich Heid, Fabienne Fritzinger, Erhard Hinrichs, Marie Hinrichs, Thomas Zastrow Institut f ur maschinelle Sprachverarbeitung, Universit at Stuttgart and Seminar

Lexical Association Measures Collocation Extraction Pavel Pecina pecina@ufal.mff.cuni.cz

Automatic Collocation Extraction from Text Corpora Pavel Pecina Ustav form aln a

Lexical Association Measures Collocation Extraction Pavel Pecina pecina@ufal.mff.cuni.cz

Lexical Association Measures Collocation Extraction Pavel Pecina pecina@ufal.mff.cuni.cz

uf: Minimizing the Coq Extraction TCB Eric Mullen , Stuart Pernsteiner, James Wilcox, Zachary

An Extensive Empirical Study of Collocation Extraction Methods Pavel Pecina

Soil Extraction Cell: An Alternative Soil Extraction Cell: An Alternative Method of Soil

Complex Numbers Complex Numbers 1 / 19 Complex Numbers Complex numbers ( C ) are an extension of

Optimization-Based Control: Direct Collocation Methods for Trajectory and Policy Optimization CS

The Direct Collocation Method for Optimal Control Gilbert Gede May 26, 2011 Gilbert Gede The

Reduced Basis Collocation Methods for Partial Differential Equations with Random Coefficients

Quadratic C 1 -spline collocation for reaction-diffusion problems Torsten Linss 1 Goran Radojev 2

Numerical Optimal Control with DAEs Lecture 8: Direct Collocation S ebastien Gros AWESCO PhD

Tools for collocation extraction: preferences for active vs. passive Ulrich Heid Marion Weller

Intermembrane Space H + H + Cyt c Co Q Complex Complex III IV H + ATPase H + Complex

Declarative Information Extraction Declarative Information Extraction Using Datalog Datalog with

MANAGING ECONOMIC DEVELOPMENT ORGANIZATIONS ANNUAL BASIC ECONOMIC DEVELOPMENT COURSE UNC CHAPEL

Existing Regional Architectures in Asia Existing Regional Architectures in Asia - Association of

Asia s Integration Now and Next A brief background 16 April 2012 1 Intro: Welcome to the

City-regions in Europe: creativity, connectivity and sustainability Kevin Morgan School of City

#ReadyToDiversify Healthy, vibrant and safe counties across the U.S. Advocate Policy

Frances Frisken Sponsors: The Neptis Foundation The GTA Forum Vivian and David Campbell

Toward the end of EU -Mercosur Interregionalism? Sebastian Santander Lige University

SWFs and the role of the State. Diana Barrowclough, UNCTAD Johannesburg 11 May 2017

Sambuz

Useful Links

Newsletter

Mail Us