Using WordNet to Supplement Corpus Statistics Rose Hoberman and - PowerPoint PPT Presentation

Using WordNet to Supplement Corpus Statistics Rose Hoberman and Roni Rosenfeld November 14, 2002 Sphinx Lunch Nov 2002

Data, Statistics, and Sparsity • Statistical approaches need large amounts of data • Even with lots of data long tail of infrequent events (in 100MW over half of word types occur only once or twice) • Problem: Poor statistical estimation of rare events • Proposed Solution: Augment data with linguistic or semantic knowledge (e.g. dictionaries, thesauri, knowledge bases...) Sphinx Lunch Nov 2002 1

WordNet • Large semantic network, groups words into synonym sets • Links sets with a variety of linguistic and semantic relations • Hand-built by linguists (theories of human lexical memory) • Small sense-tagged corpus Sphinx Lunch Nov 2002 2

WordNet: Size and Shape • Size: 110K synsets, lexicalized by 140K lexical entries – 70% nouns – 17% adjectives – 10% verbs – 3% adverbs • Relations: 150K – 60% hypernym/hyponym (IS-A) – 30% similar to (adjectives), member of, part of, antonym – 10% ... Sphinx Lunch Nov 2002 3

WordNet Example: Paper IS-A ... • paper → material, stuff → substance, matter → physical object → entity • composition, paper, report, theme → essay → writing ... abstraction → assignment ... work ... human act • newspaper, paper → print media ... instrumentality → artifact → entity • newspaper, paper, newspaper publisher → publisher, publishing house → firm, house, business firm → business, concern → enterprise → organization → social group → group, grouping • ... Sphinx Lunch Nov 2002 4

This Talk • Derive numerical word similarities from WordNet noun taxonomy. • Examine usefulness of WordNet for two language modelling tasks: 1. Improve perplexity of bigram LM (trained on very little data) • Combine bigram data of rare words with similar but more common proxies • Use WN to find similar words 2. Find words which tend to co-occur within a sentence. • Long distance correlations often semantic • Use WN to find semantically related words Sphinx Lunch Nov 2002 5

Measuring Similarity in a Taxonomy • Structure of taxonomy lends itself to calculating distances (or similarities) • Simplest distance measure: length of shortest path (in edges) • Problem: edges often span different semantic distances • For example: plankton IS-A living thing rabbit IS-A leporid ... IS-A mammal IS-A vertebrate IS-A ... animal IS-A living thing Sphinx Lunch Nov 2002 6

Measuring Similarity using Information Content • Resnik’s method: use structure and corpus statistics • Counts from corpus ⇒ probability of each concept in the taxonomy ⇒ “information content” of a concept. • Similarity between concepts = the information content of their least common ancestor: sim ( c 1 , c 2 ) = − log( p ( lca ( c 1 , c 2 ))) • Other similarity measures subsequently proposed Sphinx Lunch Nov 2002 7

Similarity between Words • Each word has many senses (multiple nodes in taxonomy) • Resnik’s word similarity: max similarity between any of their senses • Alternative definition: the weighted sum of sim ( c 1 , c 2 ) , over all pairs of senses c 1 of w 1 and c 2 of w 2 , where more frequent senses are weighted more heavily. • For example: TURKEY vs. CHICKEN TURKEY vs. GREECE Sphinx Lunch Nov 2002 8

Improving Bigram Perplexity • Combat sparseness → define equivalence classes and pool data • Automatic clustering, distributional similarity, ... • But... for rare words not enough info to cluster reliably • Test whether bigram distributions of semantically similar words (according to WordNet) can be combined to reduce the bigram perplexity of rare words Sphinx Lunch Nov 2002 9

Combining Bigram Distributions • Simple linear interpolation • p s ( ·| t ) = (1 − λ ) p gt ( ·| t ) + λp ml ( ·| s ) • Optimize lambda using 10-way cross-validation on training set • Evaluate by comparing the perplexity on a new test set of p s ( ·| t ) with the baseline model p gt ( ·| t ) . Sphinx Lunch Nov 2002 10

Ranking Proxies • Score each candidate proxy s for target word t 1. WordNet similarity score: wsim max ( t, s ) 2. KL Divergence: D ( p gt ( ·| t ) || p ml ( ·| s )) 3. Training set perplexity reduction of word s , i.e. the improvement in perplexity of p s ( ·| t ) compared to the 10-way cross-validated model. 4. Random: choose proxy randomly • Choose highest ranked proxy (ignore actual scales of scores) Sphinx Lunch Nov 2002 11

Experiments • 140MW of Broadcast News – Test: 40MW reserved for testing – Train: 9 random subsets of training data (1MW - 100MW) • From nouns occurring in WordNet: – 150 target words (occurred < 2 times in 1MW) – 2000 candidate proxies (occurred > 50 times in 1MW) Sphinx Lunch Nov 2002 12

Methodology for each size training corpus: • Find highest scoring proxy for each target word and each ranking method • Target word: ASPIRATIONS best Proxies: SKILLS DREAMS DREAM/DREAMS HILL • Create interpolated models and calculate perplexity reduction on test set • Average perplexity reduction: weighted average of the perplexity reduction achieved for each target word, weighted by the frequency of each target word in the test set Sphinx Lunch Nov 2002 13

WordNet 7 Random KLdiv 6 TrainPP Percent PP reduction 5 4 3 2 1 1 2 3 4 5 10 25 50 100 Data Size in Millions of Words Figure 1: Perplexity reduction as a function of training data size for four similarity measures. Sphinx Lunch Nov 2002 14

4 random WNsim avg Percent PP reduction KLdiv 3 cvPP 2 1 0 −1 −2 0 500 1000 1500 proxy rank Figure 2: Perplexity reduction as a function of proxy rank for four similarity measures. Sphinx Lunch Nov 2002 15

Error Analysis % Type of Relation Examples 45 Not an IS-A relation rug-arm, glove-scene 40 Missing or weak in WN aluminum-steel, bomb-shell 15 Present in WN blizzard-storm Table 1: Classification of best proxies for 150 target words. • Each target word ⇒ proxy with largest test PP reduction ⇒ categorized relation • Also a few topical relations (TESTAMENT-RELIGION) and domain specific relations (BEARD-MAN) Sphinx Lunch Nov 2002 16

Modelling Semantic Coherence • N-grams only model short distances • In real sentences content words come from same semantic domain • Want to find long-distance correlations • Incorporate semantic similarity constraint into exponential LM Sphinx Lunch Nov 2002 17

Modelling Semantic Coherence II • Find words that co-occur within a sentence. • Association statistics from data only reliable for high frequency words • Long-distance associations are semantic • Use WN ? Sphinx Lunch Nov 2002 18

Experiments • “Cheating experiment” to evaluate usefulness of WN • Derive similarities from WN for only frequent words • Compare to measure of association calculated from large amounts of data. (ground truth) • Question: are these two measures correlated? Sphinx Lunch Nov 2002 19

”Ground Truth” • 500,000 noun pairs • Expected number of chance co-occurrences > 5 • Word pair association: (Yule’s statistic) Q = C 11 · C 22 − C 12 · C 21 C 11 · C 22 + C 12 · C 21 Word 1 Yes Word 1 No Word 2 Yes C 11 C 12 C 21 C 22 Word 2 No • Q ranges from -1 to 1 Sphinx Lunch Nov 2002 20

Sphinx Lunch Nov 2002 21

Figure 3: Looking for Correlation: WordNet similarity scores versus Q scores for 10,000 noun pairs Sphinx Lunch Nov 2002 22

1.5 wsim > 6 All pairs 1.0 Density 0.5 0.0 −1.0 −0.5 0.0 0.5 1.0 Q Score Only 0.1% of wordpairs have WordNet similarity scores above 5 and only 0.03% are above 6. Sphinx Lunch Nov 2002 23

0.8 weighted maximum 0.6 precision 0.4 0.2 0.00 0.01 0.02 0.03 0.04 0.05 recall Figure 4: Comparing effectiveness of two WordNet word similarity measures Sphinx Lunch Nov 2002 24

Relation Type Num Examples WN 277(163) part/member 87 (15) finger-hand, student-school phrase isa 65 (47) death tax IS-A tax coordinates 41 (31) house-senate, gas-oil morphology 30 (28) hospital-hospitals isa 28 (23) gun-weapon, cancer-disease antonyms 18 (13) majority-minority reciprocal 8 (6) actor-director, doctor-patient non-WN 461 topical 336 evidence-guilt, church-saint news and events 102 iraq-weapons, glove-theory other 23 END of the SPECTRUM Table 2: Error Analysis Sphinx Lunch Nov 2002 25

Conclusions? • Very small bigram PP improvement when little data available • Words with very high WN similarity do tend to co-occur within sentences, • However recall is poor because most relations topical (but WN adding topical links) • Limited types and quantities of relationships in WordNet compared to the spectrum of relationships found in real data • WN word similarities weak source of knowledge for 2 tasks Sphinx Lunch Nov 2002 26

Using WordNet to Supplement Corpus Statistics Rose Hoberman and - PowerPoint PPT Presentation

Using WordNet to Supplement Corpus Statistics Rose Hoberman and Roni Rosenfeld November 14, 2002 Sphinx Lunch Nov 2002 Data, Statistics, and Sparsity Statistical approaches need large amounts of data Even with lots of data long tail of

KDI An Example of Linguistic Resource: WordNet Fausto Giunchiglia and Mattia Fumagallli

The need for Corpus Statistics: Corpus analysis and the identification of linguistically relevant

From LEX i TRON to Asian WordNet on From LEX i TRON to Asian WordNet on Collaborative

Whither WordNet? Christiane Fellbaum George A. Miller Princeton University WordNet was made

Wordnet Ontology as a Wordnet Ontology as a Geographical Information Geographical Information

Corpus Stylistics: Speech, Writing and Thought Presentation in a Corpus of English Writing

Comparing Ontology-based and Corpus- based Domain Annotations in WordNet. A paper by: Bernardo

Corpus-based Semantic Relatedness for the Construction of Polish WordNet Bartosz Broda 1 ,

Verbs in the Open Multilingual Wordnet Francis Bond Linguistics and Multilingual Studies,

TrustedOut Corpus Intelligence Corpus Intelligence Makes Intelligence Trustworthy. Florent Solt,

MACAQ : A Multi Annotated Corpus to study how we adapt Answers to various Questions Anne

Corpus Analysis from a Mathematical Perspective Corpus Statistics Research Group launch event

Visualization as an Analysis Tool: Presentation Supplement Tis document is a supplement to the

abc Eva Herr Peter Dawson ECC Built Environment Branch Essex Design Guide The Urban Place

2016 Compliance Supplement Joe Bergene, CPA Altman, Rogers & Co. Overview Topics to be

DR400/155CDI Document n 1002382 GB Supplement POH DR400/120D, DR400/140B, DR400/180R,

Relations among partitions. IV: Adjusting for more than one partition R. A. Bailey University of

Block-structured Adaptive Mesh Refinement Methods for Conservation Laws Theory, Implementation

How proofs are prepared at Camelot Andreas Bjrklund Petteri Kaski Lund University Aalto

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

Outline Scale-Free Networks Networks Scale-Free Networks Original model Original model

Marcel Dettling Institute for Data Analysis and Process Design Zurich University of Applied

Mining the graph structures of the web Aristides Gionis Yahoo! Research, Barcelona, Spain, and

n-grams BM1: Advanced Natural Language Processing University of Potsdam Tatjana Scheffler

Using WordNet to Supplement Corpus Statistics Rose Hoberman and - PowerPoint PPT Presentation

Using WordNet to Supplement Corpus Statistics Rose Hoberman and Roni Rosenfeld November 14, 2002 Sphinx Lunch Nov 2002 Data, Statistics, and Sparsity Statistical approaches need large amounts of data Even with lots of data long tail of

KDI An Example of Linguistic Resource: WordNet Fausto Giunchiglia and Mattia Fumagallli

The need for Corpus Statistics: Corpus analysis and the identification of linguistically relevant

From LEX i TRON to Asian WordNet on From LEX i TRON to Asian WordNet on Collaborative

Whither WordNet? Christiane Fellbaum George A. Miller Princeton University WordNet was made

Wordnet Ontology as a Wordnet Ontology as a Geographical Information Geographical Information

Corpus Stylistics: Speech, Writing and Thought Presentation in a Corpus of English Writing

Comparing Ontology-based and Corpus- based Domain Annotations in WordNet. A paper by: Bernardo

Corpus-based Semantic Relatedness for the Construction of Polish WordNet Bartosz Broda 1 ,

Verbs in the Open Multilingual Wordnet Francis Bond Linguistics and Multilingual Studies,

TrustedOut Corpus Intelligence Corpus Intelligence Makes Intelligence Trustworthy. Florent Solt,

MACAQ : A Multi Annotated Corpus to study how we adapt Answers to various Questions Anne

Corpus Analysis from a Mathematical Perspective Corpus Statistics Research Group launch event

Visualization as an Analysis Tool: Presentation Supplement Tis document is a supplement to the

abc Eva Herr Peter Dawson ECC Built Environment Branch Essex Design Guide The Urban Place

2016 Compliance Supplement Joe Bergene, CPA Altman, Rogers &amp; Co. Overview Topics to be

DR400/155CDI Document n 1002382 GB Supplement POH DR400/120D, DR400/140B, DR400/180R,

Relations among partitions. IV: Adjusting for more than one partition R. A. Bailey University of

Block-structured Adaptive Mesh Refinement Methods for Conservation Laws Theory, Implementation

How proofs are prepared at Camelot Andreas Bjrklund Petteri Kaski Lund University Aalto

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

Outline Scale-Free Networks Networks Scale-Free Networks Original model Original model

Marcel Dettling Institute for Data Analysis and Process Design Zurich University of Applied

Mining the graph structures of the web Aristides Gionis Yahoo! Research, Barcelona, Spain, and

n-grams BM1: Advanced Natural Language Processing University of Potsdam Tatjana Scheffler

2016 Compliance Supplement Joe Bergene, CPA Altman, Rogers & Co. Overview Topics to be