Natural Language Processing and Machine Leaning: Synergy or Discord- a Case Study with MT, IR and Sentiment FIRE 2016 Pushpak Bhattacharyya IIT Patna and IIT Bombay pb@cse.iitb.ac.in 9 th Dec, 2016
Need for NLP • Huge amount of language data in electronic form • Unstructured data (like free flowing text) will grow to 40 zetabytes (1 zettabyte= 10 21 bytes) by 2020. • How to make sense of this huge data? • Example-1: e-commerce companies need to know sentiment of online users, sifting through 1 lakh e- opinions per week: needs NLP • Example-2: Translation industry to grow to $37 billion business by 2020
Nature of Machine Learning • Automatically learning rules and concepts from data Learning the concept of table. What is “ tableness ” Rule: a flat surface with 4 legs (approx.: to be refined gradually)
Why NLP and ML? • Impossible for humans (single or a team) to makes sense of and analyse humongous text data • Many processing steps in NLP • Impossible to give correct-consistent-complete rules covering each and every situation • Example: Rule: Adjectives preceded Nouns (“blue sky”), but not in French ! (“ ciel bleu”)
NLP: layered, multidimensional Problem Semantics NLP Trinity Parsing Part of Speech Tagging Discourse and Co reference Morph Increased Marathi French Analysis Semantics Complexity Of HMM Processing Hindi English Parsing Language CRF MEMM Algorithm Chunking POS tagging Morphology
NLP= Ambiguity Processing • Lexical Ambiguity • Structural Ambiguity • Semantic Ambiguity • Pragmatic Ambiguity
Examples 1. (ellipsis) Amsterdam airport: “Baby Changing Room ” 2. (Attachment/grouping) Public demand changes (credit for the phrase: Jayant Haritsa): (a) Public demand changes, but does any body listen to them? (b) Public demand changes, and we companies have to adapt to such changes. (c) Public demand changes have pushed many companies out of business 3. (Pragmatics-1) The use of shin bone is to locate furniture in a dark room 9 Dec 2016 FIRE16:NLP-ML 7
New words and terms (people are very creative!!) 1. ROFL : rolling on the floor laughing; LOL : laugh out loud 2. facebook : to use facebook; google : to search 3. communifake : faking to talk on mobile; Obamacare : medical care system introduced through the mediation of President Obama (portmanteau words) 4. After BREXIT (UK's exit from EU), in Mumbai Mirror, and on Tweet: We got Brexit. What's next? Grexit. Departugal. Italeave. Fruckoff. Czechout. Oustria. Finish. Slovakout. Latervia. Byegium
Inter layer interaction Text- 1: “ I saw the boy with a telescope which he dropped accidentally ” Text- 2: “ I saw the boy with a telescope which I dropped accidentally nsubj(saw-2, I-1) Discourse and Co reference root(ROOT-0, saw-2) det(boy-4, the-3) nsubj(saw-2, I-1) Semantics dobj(saw-2, boy-4) root(ROOT-0, saw-2) det(telescope-7, a-6) det(boy-4, the-3) Parsing prep_with(saw-2, telescope-7) dobj(saw-2, boy-4) dobj(dropped-10, telescope-7) det(telescope-7, a-6) nsubj(dropped-10, he-9) prep_with(saw-2, telescope-7) Chunking rcmod(telescope-7, dropped-10) dobj(dropped-10, telescope-7) advmod(dropped-10, accidentally-11) nsubj(dropped-10, I-9) POS rcmod(telescope-7, dropped-10) tagging advmod(dropped-10, accidentally-11) Morphology
NLP: deal with multilinguality Language Typology
Rules: when and when not • When the phenomenon is understood AND expressed, rules are the way to go • “Do not learn when you know!!” • When the phenomenon “seems arbitrary” at the current state of knowledge, DATA is the only handle! – Why do we say “Many Thanks” and not “Several Thanks”! – Impossible to give a rule • Rely on machine learning to tease truth out of data; Expectation not always met with
Impact of probability: Language modeling Probabilities computed in the context of corpora 1. P(“The sun rises in the east”) 2. P(“The sun rise in the east”) • Less probable because of grammatical mistake. 3.P(The svn rises in the east) • Less probable because of lexical mistake. 4.P(The sun rises in the west) • Less probable because of semantic mistake. 9 Dec 2016 FIRE16:NLP-ML 12
Power of Data
Automatic image labeling (Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan, 2014) Automatically captioned: “Two pizzas sitting on top of a stove top oven” 9 Dec 2016 FIRE16:NLP-ML 14
Automatic image labeling (cntd) 9 Dec 2016 FIRE16:NLP-ML 15
Main methodology • Object A: extract parts and features • Object B which is in correspondence with A: extract parts and features • LEARN mappings of these features and parts • Use in NEW situations: called DECODING 9 Dec 2016 FIRE16:NLP-ML 16
Feature correspondence “I am hungry now” 9 Dec 2016 FIRE16:NLP-ML 17
Linguistics-Computation Interaction • Need to understand BOTH language phenomena and the data • An annotation designer has to understand BOTH linguistics and statistics! Linguistics and Annotator Data and Language phenomena statistical phenomena
Case Study-1: Machine Translation Good Linguistics + Good ML Pushpak Bhattacharyya, Machine Translation , CRC Press, 2015 Raj Dabre, Fabien Cromiere, Sadao Kurohash and Pushpak Bhattacharyya, Leveraging Small Multilingual Corpora for SMT Using Many Pivot Languages , NAACL 2015 , Denver, Colorado, USA, May 31 - June 5, 2015.
Kinds of MT Systems (point of entry from source to the target text) (Vauquois. 1968) 9 Dec 2016 FIRE16:NLP-ML 20
Simplified Vauquois
RBMT-EBMT-SMT spectrum: knowledge (rules) intensive to data (learning) intensive SMT RBMT EBMT 9 Dec 2016 FIRE16:NLP-ML 22
Illustration of difference of RBMT, SMT, EMT • Peter has a house • Peter has a brother • This hotel has a museum 9 Dec 2016 FIRE16:NLP-ML 23
The tricky case of ‘have’ translation English Marathi पीटरकडे एक घर आहे / piitar kade • Peter has a house – ek ghar aahe • Peter has a brother पीटरला एक भाऊ आहे / piitar laa – ek bhaauu aahe • This hotel has a museum हॎया हॉटेलमधॎये एक संग्ऱहालय आहे / – hyaa hotel madhye ek saMgrahaalay aahe 9 Dec 2016 FIRE16:NLP-ML 24
RBMT If syntactic subject is animate AND syntactic object is owned by subject Then “have” should translate to “ kade … aahe ” If syntactic subject is animate AND syntactic object denotes kinship with subject Then “have” should translate to “ laa … aahe ” If syntactic subject is inanimate Then “have” should translate to “ madhye … aahe ” 9 Dec 2016 FIRE16:NLP-ML 25
EBMT X have Y X_kade Y aahe / X_laa Y aahe / X_madhye Y aahe 9 Dec 2016 FIRE16:NLP-ML 26
SMT • has a house kade ek ghar aahe <cm> one house has • has a car kade ek gaadii aahe <cm> one car has • has a brother laa ek bhaau aahe <cm> one brother has • has a sister laa ek bahiin aahe <cm> one sister has • hotel has hotel madhye aahe hotel <cm> has • hospital has haspital madhye aahe hospital <cm> has 9 Dec 2016 FIRE16:NLP-ML 27
SMT: new sentence “This hospital has 100 beds” • n -grams ( n=1, 2, 3, 4, 5 ) like the following will be formed: – “This”, “hospital”,… (unigrams) – “This hospital”, “hospital has”, “has 100”,… (bigrams) – “This hospital has”, “hospital has 100”, … (trigrams) DECODING !!! 9 Dec 2016 FIRE16:NLP-ML 28
Foundation of SMT • Data driven approach • Goal is to find out the English sentence e given foreign language sentence f whose p(e | f) is maximum. • Translations are generated on the basis of statistical model • Parameters are estimated using bilingual parallel corpora 9 Dec 2016 FIRE16:NLP-ML 29
The all important word alignment • The edifice on which the structure of SMT is built (Brown et. Al., 1990, 1993; Och and Ney, 1993) • Word alignment Phrase alignment (Koehn et al, 2003) • Word alignment Tree Alignment (Chiang 2005, 200t; Koehn 2010) • Alignment at the heart of Factor based SMT too (Koehn and Hoang 2007) 9 Dec 2016 FIRE16:NLP-ML 30
Word alignment as the crux of Statistical Machine Translation French English (1) three rabbits (1) trois lapins a b w x (2) rabbits of Grenoble (2) lapins de Grenoble b c d x y z 9 Dec 2016 FIRE16:NLP-ML 31
Initial Probabilities: each cell denotes t(a w), t(a x) etc. a b c d w 1/4 1/4 1/4 1/4 x 1/4 1/4 1/4 1/4 y 1/4 1/4 1/4 1/4 z 1/4 1/4 1/4 1/4
“counts” a b c d b c d a b c d a b x y z w x w 0 0 0 0 w 1/2 1/2 0 0 x 0 1/3 1/3 1/3 x 1/2 1/2 0 0 y 0 1/3 1/3 1/3 y 0 0 0 0 z 0 1/3 1/3 1/3 z 0 0 0 0 9 Dec 2016 FIRE16:NLP-ML 33
Revised probabilities table a b c d w 1/2 1/4 0 0 x 1/2 5/12 1/3 1/3 y 0 1/6 1/3 1/3 z 0 1/6 1/3 1/3
“revised counts” a b c d a b c d a b b c d w x x y z w 1/2 3/8 0 0 w 0 0 0 0 x 1/2 5/8 0 0 x 0 5/9 1/3 1/3 y 0 0 0 0 y 0 2/9 1/3 1/3 z 0 0 0 0 z 0 2/9 1/3 1/3 9 Dec 2016 FIRE16:NLP-ML 35
Recommend
More recommend