Exploring Multi-level Distributional Semantics for Cross-lingual Entity Discovery and Linking Boliang Zhang, Xiaoman Pan, Lifu Huang, Ying Lin, Heng Ji jih@rpi.edu �
Noisy Training Data Acquisition 1: Chinese Room 2
Noisy Training Data Acquisition 1: Chinese Room 3
Noisy Training Data Acquisition 2: Wikipedia Mining § Generate “silver-standard” training data automa4cally § Apply self-training to make training data for complete and consistent 4
Exploit Non-traditional Universal Linguistic Resources • Grammar books from Lori Levin’s bookshelf and CIA Names from DARPA PM’s bookshelf Unicode Common Locale Data Repository, Wiki4onary, Panlex, Mul4lingual WordNet, • GeoNames, JRC Names, phrase pairs mined � from Wikipedia Phrase Books from Language Survival Kits and � • Elicita4on Corpus Ignored by NLP community • 5
Linguistic Structure from WALS database and Syntactic Structures of the World's Languages� WALS and SSWL • Universal Morphology Analyzer based on Wikipedia Markups o Kıta Fransası, güneyde [[Akdeniz]] den kuzeyde [[Manş Denizi]] ve [[Kuzey Denizi]] ne , doğuda [[Ren Nehri]] nden ba@da [[Atlas Okyanusu]] na kadar yayılan topraklarda yer alır. (ConGnental France is located in the south [[Mediterranean Sea]] in the north [[English Sea]] and [[North Sea]] in the east [[Rhine River]] to the west [[AtlanGc Ocean]].) 6
Character-Aware Word Embeddings Mo4va4on: men4ons of the same concept across languages may share a set of • similar characters, e.g., SemseSn Gunaltay (English) = ŞemseSn Günaltay (Turkish) = Semse4n Ganoltey (Somali) Compose word embeddings from shared character embeddings using • Convolu4onal Neural networks Further op4mized by language model based on Recurrent Neural Networks • maximize the predic4on of the current word based on previous words § 7
Feed Non-traditional Linguistic Resources into DNN B/I/O CRF networks Hidden Layer LSTMs Left Right Left Right Hidden Layer LSTMs LSTMs LSTMs LSTMs Input Word Linguistic Feature Embedding Embedding Word Embedding Linguistic Features Left Right CNN - English and Low-resource Language LSTMs LSTMs Patterns - Low-resource Language to English Lexicons - Gazetteers Character Character - Low-resource Language Grammar Rules Embedding Embedding 8
Common Semantic Space Construction 9
Construct a Common Semantic Space for Thousands of Languages § Mo4va4ons § There are 3000+ languages with electronic record § NLP training data only available for several dominant languages § Goals § Build a common seman4c space across thousands of languages for resource sharing and richer seman4c con4nuous representa4on for words, concepts and en44es § Limita4ons of Previous A_empts (e.g., Upadhyay et al., 2016, Cho et al., 2017) § Mostly English-anchored, cannot capture all linguis4c phenomena § Heavily relied on bilingual dic4onaries and parallel data which are not always available § Only limited to dozens of languages 10
Multi-Level Multi-lingual Alignment • When bilingual word dic4onaries are not available, back-off to shared linguis4c structures e.g., apposi4on, conjunc4on, plural suffix (English (-s / - es), Turkish (- § lar / -ler), Somali (-o)) Generalized from language universal resources such as WALS database § and SyntacGc Structures of the World's Languages Classify languages according to a large number of topological proper4es § (phonological, lexical, gramma4cal) 2,676 languages, 58,000+ (language, feature, feature value) tuples, e.g., § (English, canonical word order, SVO) • Project monolingual word embeddings into a common seman4c space, and align both representa4ons of words and linguis4c- structures in the common space 11
Model Training • Model training o Language model predic4on loss o Mul4lingual alignment loss: o Overall loss: 12
Linguistic Features MaNer:� More Robust to Noise Uzbek (Zhang et al., 2017) 13
Impact of Character-Aware Word Embeddings Name Tagging F-Score (%) § Models Chinese English Spanish Before 64.1 67.4 64.6 Aoer 68.0 70.9 68.9 14
Impact of Common Semantic Space • Chechen Name Tagging Models P (%) R (%) F (%) Randomly ini4alized 46.3 45.31 45.8 Pre-trained 54.8 41.3 47.1 + Common seman4c 62.1 50.1 55.4 space word embedding 15
Something Old: Hierarchical Brown Clustering Languages w/o BC (%) with BC (%) Languages w/o BC (%) with BC (%) Albanian 72.4 74.6 Northern Sotho 90.2 90.8 Chechen 53.1 55.4 Polish 49.6 53.2 Chinese 66.3 68.0 Somali 76.9 78.5 English 69.5 70.9 Spanish 67.1 68.9 Kannada 51.9 56.0 Swahili 64.3 67.8 Kikuyu 84.2 88.7 Yoruba 46.1 49.5 Nepali 41.6 43.9 16
� � Joint Learning of Word and Entity Embeddings from Wikipedia Consider all Wikipedia anchor links as en4ty annota4ons, a training corpus can • be created by replacing anchor links with unique en4ty IDs. e.g., [[en/Apple|apple]] is a fruit � [[en/Apple_Inc.|apple]] is a company � apple is a fruit � en/Apple is a fruit � apple is a company � en/Apple_Inc. is a company � Mul4-lingual • 17
Joint Learning of Word and Entity Embeddings from Wikipedia e 1 Entity Representation Learning Philadelphia Fireworks o n s b r a t i e l e C Independence born Day (US) P ( N ( e j ) | e j ) country N ( · ) e Independence Day ( film ) inlink Observed by Will category Smith e Independence Day ( US ) , , e j , e N ( · ) g n i r r a United e Memorial Day t s outlink Public holidays in States e 2 the United States word embeddings N ( · ) Independence O Day (film) b y s ⇤ , e 3 e r o r g v e e a t d Knowledge Space c Memorial b y outlink i n Day l i n k Mention Representation Learning Knowledge Base C ( · ) P ( e j |C ( m l ) , s ⇤ j ) bands played it during public events, played it during public Mention Sense C ( · ) such as events, such as Mapping , , e j , e s ∗ s ⇤ [[Independence Day Independence Day ( film ) [[ ]] Independence Day ( US ) C ( · ) g ( July 4 th, e 1 ) (US)|July 4th]] celebrations w film celebrations , s ⇤ j , w Anchor ) , s ∗ … In the 1996 action film [[Independence Day Independence Day ( US ) Text Representation Learning d 1 (film)|Independence Day]], the United States P ( C ( w i ) | w i ) · P ( C ( m l ) | s ⇤ w celebrations j ) s ⇤ military uses alien technology captured … Memorial Day C ( · ) … holds annual [[Independence Day (US)| Text Space , d 2 , d Independence Day]] celebrations and other ⇤ , w i /s ⇤ festivals … C ( · ) j … early Confederate [[Memorial Day]] , d 3 , s celebrations were simple, somber occasions for C ( · ) veterans and their families to honor the dead … Text Representation Learning 18
� � � � Learning Entity Embeddings from DBpedia Construct a weighted undirected graph G = (E, D) from DBpedia, where E • is a set of all en44es in DBpedia and d ij ∈ D indicates that two en44es e i and e j share some DBpedia proper4es. The weight of d ij , w ij is computed as: � | p i \ p j | w ij = max( | p i | , | p j | ) where p i , p j are the sets of DBpedia proper4es of e i and e j respec4vely. � Apply the graph embedding framework proposed by (Tang et al., 2015) to • generate knowledge representa4ons for all en44es 19
Impact of Joint Embeddings on Entity Linking • Unsupervised en4ty linking based on salience, similarity and coherence • Tested on EDL16 perfect English NAM men4ons CEAFm P CEAFm R CEAFm F1 Baseline 0.762 0.843 0.801 + Joint word and en4ty 0.791 0.875 0.831 embeddings from Wikipedia + En4ty embedding from 0.812 0.897 0.852 DBpedia 20
Resources and Demos 21
Systems, Data and Resources Publicly Available § Re-trainable Systems: § h_p://blender02.cs.rpi.edu:3300/elisa_ie/api § Source code base available for government users upon requests § Tri-lingual EDL is being integrated into CoreNLP and hope to release in 2017 § Data and Resources: § h_p://nlp.cs.rpi.edu/wikiann/ § Demos: § h_p://blender02.cs.rpi.edu:3300/elisa_ie § h_p://blender02.cs.rpi.edu:3300/elisa_ie/heatmap 22
Demo 1: Cross-lingual Entity Discovery and Linking for 282 Languages § h_p://blender02.cs.rpi.edu:3300/elisa_ie § h_p://blender02.cs.rpi.edu:3300/elisa_ie/heatmap 23
Cross-lingual Entity Discovery and Linking for 282 Languages (Cont’) 24
Cross-lingual Entity Discovery and Linking for 282 Languages (Cont’) 25
IE Application Example: Disaster Relief 26
Cross-lingual Entity Discovery and Linking for 282 Languages (Cont’) § h_p://blender02.cs.rpi.edu:3300/elisa_ie/heatmap 27 27
Recommend
More recommend