Today Taal- en spraaktechnologie Sophia Katrenko Utrecht University, the Netherlands Sophia Katrenko Lecture 1
Today Outline Today 1 Collocations in linguistics Automatic collocation extraction Sophia Katrenko Lecture 1
Collocations in linguistics Today Automatic collocation extraction Today we discuss Chapter 5 (Manning and Sch¨ utze. “Foundations of statistical natural language processing”), and more precisely look at how to use probability theory machinery for NLP 1 research (e.g., mutual information for detecting collocations) Sophia Katrenko Lecture 1
Collocations in linguistics Today Automatic collocation extraction Terminology What is a collocation in linguistics? Collocation is recurrent, relatively fixed word combination. (in S. Bartsch (2004)) E.g., red herring, kick the bucket, dark-night, dog-bark . Firth (1951) I propose to bring forward as a technical term, meaning by ‘collocation’, and to apply the test of ‘collocability’. Jespersen (1917) Little and few are also incomplete negatives: note the frequent collocation with no : there is little or no danger. Sophia Katrenko Lecture 1
Collocations in linguistics Today Automatic collocation extraction Terminology Firth Meaning by collocation is an abstraction at the syntagmatic level and is not directly concerned with the conceptual or idea approach to the meaning of words. Firth (1957) One of the meanings of night is its collocability with dark , and of dark , of course, its collocation with night . Sophia Katrenko Lecture 1
Collocations in linguistics Today Automatic collocation extraction Terminology Choueka (1988) A collocation is defined as a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning cannot be derived directly from the meaning or connotation of its components. Sophia Katrenko Lecture 1
Collocations in linguistics Today Automatic collocation extraction Terminology Two main views on collocations: F IRTH : collocations as lexical proximities in text C HOUEKA : collocations as syntactic and semantic units, semantic irregularity Sophia Katrenko Lecture 1
Collocations in linguistics Today Automatic collocation extraction Terminology Halliday (1966) a collocation defines the membership of lexical sets e.g., studies differences between strong and powerful on the examples of strong tea , strong car (non-acceptable), strong or powerful argument . argues that various grammatical configurations are possible, e.g. “ he argued strongly against ”, “ the strength of his argument , others. Sophia Katrenko Lecture 1
Collocations in linguistics Today Automatic collocation extraction Terminology Further, we distinguish between strong collocators ( blond is used to describe hair color) and weak collocators (e.g., the is used with various nouns) idioms (in English literature) or phraseological units (in German literature) ( black ingratitude ) and then many relations expressing synonyms ( honest/fair ) antonyms ( old/new ) homonyms (e.g., words that shared the same spelling and pronunciation, but not origin, e.g., bank ) Sophia Katrenko Lecture 1
Collocations in linguistics Today Automatic collocation extraction Terminology Criteria for collocations N ON - COMPOSITIONALITY The meaning of a collocation is not a straightforward composition of the meanings of its parts (especially, in the case of idioms). N ON - SUBSTITUTABILITY It is not possible to substitute the components of a collocation by their near-synonyms (e.g., orange tape instead of red tape ; gray elephant instead of white elephant ) N ON - MODIFIABILITY Many collocations cannot be freely modified with additional lexical material or through grammatical transformations (e.g., kick the heavy bucket instead of kick the bucket ) Sophia Katrenko Lecture 1
Collocations in linguistics Today Automatic collocation extraction Terminology Criteria for collocations N ON - COMPOSITIONALITY The meaning of a collocation is not a straightforward composition of the meanings of its parts (especially, in the case of idioms). N ON - SUBSTITUTABILITY It is not possible to substitute the components of a collocation by their near-synonyms (e.g., orange tape instead of red tape ; gray elephant instead of white elephant ) N ON - MODIFIABILITY Many collocations cannot be freely modified with additional lexical material or through grammatical transformations (e.g., kick the heavy bucket instead of kick the bucket ) Sophia Katrenko Lecture 1
Collocations in linguistics Today Automatic collocation extraction Terminology Criteria for collocations N ON - COMPOSITIONALITY The meaning of a collocation is not a straightforward composition of the meanings of its parts (especially, in the case of idioms). N ON - SUBSTITUTABILITY It is not possible to substitute the components of a collocation by their near-synonyms (e.g., orange tape instead of red tape ; gray elephant instead of white elephant ) N ON - MODIFIABILITY Many collocations cannot be freely modified with additional lexical material or through grammatical transformations (e.g., kick the heavy bucket instead of kick the bucket ) Sophia Katrenko Lecture 1
Collocations in linguistics Today Automatic collocation extraction Terminology S. Evert: However, collocations range from completely fixed to syntactically flexible constructions. syntactic restrictions usually coincide with semantic restrictions and thus are indicators for the degree of lexicalization of a particular word combination. particular word combinations are associated with specific restrictions cannot be inferred from standard rules of grammar and thus need to be stored together with the collocation. Sophia Katrenko Lecture 1
Collocations in linguistics Today Automatic collocation extraction Terminology Collocations can be word level phenomena (e.g., Red Cross , fix und fertig ) phrase level phenomena (collocation phrase) (copula constructions, proverbs) Collocation phrases consist of the lexically determined words (collocates) only or contain additional lexically underspecified material. Sophia Katrenko Lecture 1
Collocations in linguistics Today Automatic collocation extraction Terminology S TRUCTURAL DEPENDENCY the collocates of a collocation are syntactic dependents, thus knowledge of syntactic structure is a precondition for accurate collocation identification. S YNTACTIC CONTEXT may help to discriminate literal and collocational readings, (think, e.g., of red tape ) Sophia Katrenko Lecture 1
Collocations in linguistics Today Automatic collocation extraction Terminology Collocation subclasses verb particle constructions (e.g., look up ) light verbs verbs like do or make in do a favor or make a decision proper nouns considered collocations in computational linguistics (e.g., New York ) terminological expressions phrases in a particular domain, e.g., hydraulic oil filter Sophia Katrenko Lecture 1
Collocations in linguistics Today Automatic collocation extraction Terminology How to recognize a collocation? T RANSLATION TEST if we cannot translate a phrase word by word, then it is likely to be a collocation: make a decision , break a record . A SSOCIATION a more general notion, which does not necessarily encompass grammatically bound elements: plane - airport . Sophia Katrenko Lecture 1
Collocations in linguistics Today Automatic collocation extraction Recent work Computational work on extracting collocations automatically has been very popular: work by Stefan Evert, incl. his PhD dissertation “The Statistics of Word Cooccurrences: Word Pairs and Collocations” (2004). (check http://www.collocations.de/EK/ ). multilingual collocation extraction using syntax, as in Seretan, Violeta (in press). Syntax-Based Collocation Extraction. Springer. 2011. Why is is still challenging? collocations are not always adjacent collocations pose a problem for machine translation Sophia Katrenko Lecture 1
Collocations in linguistics Today Automatic collocation extraction Procedure According to S. Evert, the extraction of collocations involves the following steps: Use application dependent notion of collocation as opposed to Firth/Choueka. Extract collocations from relational data (relational n-grams). Consider grammatically homogenous data (e.g. Adj-N, PP-V). Use recurrence/cooccurrence frequency as main criterion for collocation extraction (statistical approaches). Sophia Katrenko Lecture 1
Collocations in linguistics Today Automatic collocation extraction Procedure Depending on what type of collocations we want to extract (e.g., of the type subj-verb), we may need to do the following preprocessing steps: tokenization (orthographic words) pos-tagging morphological analysis / lemmatization partial parsing (full parsing) Sophia Katrenko Lecture 1
Collocations in linguistics Today Automatic collocation extraction Procedure Candidate extraction strategies: S TRATEGY 1 Retrieval of n-grams from word forms only. S TRATEGY 2 Retrieval of n-grams from part-of-speech annotated word forms. S TRATEGY 3 Retrieval of n-grams from word forms with particular parts-of-speech, at particular positions in syntactic structure. Sophia Katrenko Lecture 1
Recommend
More recommend