Introduction to CL Session 1: 7/08/2011
What is computational linguistics? Processing natural language text by computers for practical applications ... or linguistic research • Among practical applications Sometimes the computer only needs to classify or transform the text ... but sometimes it needs to “understand” Ex: Watson: winner of ‘Jeopardy’ CL vs. NLP (natural language processing)
NLP applications • Automatic speech recognition (ASR): speech text • Machine translation (MT): L1 L2 • Information retrieval (IR): Query + documents a subset of doc • Information extraction (IE): document “database”
NLP applications (cont) • Question answering (QA): Question + documents Answer • Summarization: documents summary • Natural language generation (NLG): representation text
Other Applications • Call Center • Spam filter • Spell checker • Sentiment analysis: product reviews • Bio-NLP: processing clinical data • ….
Basic NLP tasks: Shallow processing • Tokenization: – He visited New York in 2003. • Morphological analysis: – visited visit + -ed • Part-of-speech tagging – He/Pron visited/V New/?? York/N in/Prep 2003/CD • Name-entity tagging – He visited [LOCATION New York] in [YEAR 2003] • Chunking – [NP He] [V visited] [NP New York] in [NP 2003]
Basic NLP tasks: Deep processing • Parsing – (S (NP (PRON he)) (VP (V visited) ….) • Semantic analysis – Semantic tagging: *AGENT He+ visited *DEST New York+ …. – Meaning: visit (he, New-York) • Discourse – Co- reference: “He” refers to “John” – Discourse structure • Dialogue • Generation
Ambiguity • Phonological ambiguity: (ASR) – “too”, “two”, “to” – “ice cream” vs. “I scream” – “ta” in Mandarin: he, she, or it • Morphological ambiguity: (morphological analysis) – unlockable: [[un-lock]-able] vs. [un-[lock-able]] • Syntactic ambiguity: (parsing) – John saw a man with a telescope. – Time flies like an arrow.
Ambiguity (cont) • Lexical ambiguity: (WSD) – Ex: “bank”, “saw”, “run” • Semantic ambiguity: (semantic representation) – Ex: every boy loves his mother – Ex: John and Mary bought a house • Discourse ambiguity: – Susan called Mary. She was sick. (coreference resolution) – It is pretty hot here. (intention resolution) • Machine translation: – “brother”, “cousin”, “uncle”, etc.
Ambiguity resolution • Rule-based or knowledge-based: – Parsing: • I saw a man with a hat • I saw a man with a telescope (in my hand) – WSD: • “bank” – MT: • “brother”, “cousin”, “uncle” • Statistical approach: – Require training data – Build a statistical model – Knowledge and rules can be incorporated into the model as features etc.
Major approaches to NLP • Rule-based approach • Statistical approach – Supervised learning – Semi-supervised learning – Unsupervised learning
Supervised learning algorithms • Hidden Markov Model (HMM) • Decision tree • Decision list • Naïve Bayes • Transformation-based Learning (TBL) • Maximum Entropy (MaxEnt) • Support Vector Machine (SVM) • Conditional Random Field (CRF) • …
Data • Raw text: – Monolingual: English/Chinese/Arabic Gigawords – Parallel data: UN data, EuroParl • Treebank: – Syntactic treebanks: a set of parse trees – Proposition Bank: – Discourse Treebank • Dictionaries • WordNet • FrameNet • …
Applications Task1 Task2 Task_i … ML1 ML2 … ML_m … D1 D2 D_n
The role of linguistics knowledge in NLP • An NLP system is language-independent. • Good or bad? – Good: it can be ported to many languages without any changes. – Bad: it cannot take advantage of properties of certain languages. • How to incorporate (linguistic) knowledge in statistical systems? – the design of models – as features – as filters – … Building a treebank is an effective way.
Recommend
More recommend