CS 4501 Machine Learning for NLP Introduction Yangfeng Ji - PowerPoint PPT Presentation

CS 4501 Machine Learning for NLP Introduction Yangfeng Ji Department of Computer Science University of Virginia

Overview 1. Course Information 2. Basic Linear Algebra 3. Basic Probability Theory 4. Statistical Estimation 1

About Online Lectures ◮ All lectures will be recorded and uploaded to Collab ◮ By default, participants are muted upon entry. If you have a question ◮ Chime in ◮ Use the “Raise Hand” feature ◮ Send a message via Chat ◮ By default, video is off upon entry 2

About Online Lectures ◮ All lectures will be recorded and uploaded to Collab ◮ By default, participants are muted upon entry. If you have a question ◮ Chime in ◮ Use the “Raise Hand” feature ◮ Send a message via Chat ◮ By default, video is off upon entry ◮ Create a Slack workspace for this course (?) 2

Course Information

Course Webpage http://yangfengji.net/uva-nlp-course/ 4

Instructors ◮ Instructor ◮ Yangfeng Ji ◮ Office hour: TBD 5

Instructors ◮ Instructor ◮ Yangfeng Ji ◮ Office hour: TBD ◮ TA: ◮ Stephanie Schoch ◮ Office hour: TBD 5

Clarification This is not the class if you want to ◮ learn programming ◮ learn basic machine learning ◮ learn how to use PyTorch 6

Goal of This Course 1. Explain the fundamental NLP techniques ◮ Text classification ◮ Language modeling ◮ Word embeddings ◮ Sequence labeling ◮ Machine translation 2. Advanced topics ◮ Discourse processing, text generation, interpretability in NLP 3. Opportunities of working on some NLP problems ◮ Final project 7

Assignments ◮ No exam 8

Assignments ◮ No exam ◮ Six homeworks ◮ 14% × 6 = 84% 8

Assignments ◮ No exam ◮ Six homeworks ◮ 14% × 6 = 84% ◮ One final project ◮ 2 – 3 students per group ◮ Proposal: 4% ◮ Final presentation: 6% ◮ Final project report: 6% 8

Policy: late penalty Homework submission will be accepted up to 72 hours late, with 20% deduction per 24 hours on the points as a penalty. For example, ◮ Deadline: August 30th, 11:59 PM ◮ Submission timestamp: September 1st, 9:00 AM ( ≤ 48 hours) ◮ Original points of a homework: 7 ◮ Actual points: 7 × ( 1 − 40% ) = 4 . 2 (1) It is usually better if students just turn in what they have in time. 9

Policy: collaboration ◮ Homeworks ◮ Collaboration is not encouraged ◮ Students are allowed to discuss with their classmates ◮ Final project ◮ It should be a team effort 10

Policy: grades 11

Textbooks ◮ Textbook ◮ Eisenstein, Natural Language Processing , 2018 All free online 12

Textbooks ◮ Textbook ◮ Eisenstein, Natural Language Processing , 2018 ◮ Additional textbooks ◮ Jurafsky and Martin, Speech and Language Processing , 3rd Edition, 2019 ◮ Smith, Linguistic Structure Prediction , 2009 ◮ Shalev-Shwartz and Ben-David, Understanding Machine Learning: From Theory to Algorithms , 2014 ◮ Goodfellow, Bengio and Courville, Deep Learning , 2016 All free online 12

Piazza https://piazza.com/virginia/fall2020/cs4501003 ◮ course announcements ◮ online QA 13

Question? 14

Basic Linear Algebra

Linear Equations Consider the following system of equations 푥 1 − 푥 2 = 1 (2) Each equation represents a line in the following 2-D space 푥 2 푥 1 16

Linear Equations Consider the following system of equations 푥 1 − 푥 2 = 1 (2) 푥 1 + 2 푥 2 = 2 Each equation represents a line in the following 2-D space 푥 2 푥 1 16

Linear Equations Consider the following system of equations 푥 1 − 푥 2 = 1 (3) 푥 1 + 2 푥 2 = 2 In matrix notation, it can be written as a more compact from A 풙 = 풃 (4) with � � � � � � − 1 푥 1 1 1 A = 풙 = 풃 = (5) 1 2 푥 2 2 17

Basic Notations � � � � � � − 1 1 푥 1 1 A = 풙 = 풃 = 1 2 푥 2 2 ◮ A ∈ ℝ 푚 × 푛 : a matrix with 푚 rows and 푛 columns ◮ The element on the 푖 -th row and the 푗 -th column is denoted as 푎 푖,푗 ◮ 풙 ∈ ℝ 푛 : a vector with 푛 entries. By convention, an 푛 -dimensional vector is often thought of as matrix with 푛 rows and 1 column, known as a column vector. ◮ The 푖 -th element is denoted as 푥 푖 18

Basic Notations � � � � � � − 1 1 푥 1 1 A = 풙 = 풃 = 1 2 푥 2 2 ◮ A ∈ ℝ 푚 × 푛 : a matrix with 푚 rows and 푛 columns ◮ The element on the 푖 -th row and the 푗 -th column is denoted as 푎 푖,푗 ◮ 풙 ∈ ℝ 푛 : a vector with 푛 entries. By convention, an 푛 -dimensional vector is often thought of as matrix with 푛 rows and 1 column, known as a column vector. ◮ The 푖 -th element is denoted as 푥 푖 Problem : Solve a matrix-vector multiplication with hands and with PyTorch 18

ℓ 2 Norm The ℓ 2 norm of a vector 풙 ∈ ℝ 푛 is defined as � � 푛 � 푥 2 � 풙 � 2 = (6) 푖 푖 = 1 푥 2 풙 � 풙 � 2 푥 1 19

ℓ 1 Norms The ℓ 1 norm of a vector 풙 ∈ ℝ 푛 is defined as 푛 � � 풙 � 1 = | 푥 푖 | (7) 푖 = 1 20

Dot Product The dot product of 풙 , 풚 ∈ ℝ 푛 is defined as 푛 � � 풙 , 풚 � = 풙 T 풚 = 푥 푖 푦 푖 (8) 푖 = 1 where 풙 T is the transpose of 풙 . ◮ � 풙 � 2 2 = � 풙 , 풙 � 21

Dot Product The dot product of 풙 , 풚 ∈ ℝ 푛 is defined as 푛 � � 풙 , 풚 � = 풙 T 풚 = 푥 푖 푦 푖 (8) 푖 = 1 where 풙 T is the transpose of 풙 . ◮ � 풙 � 2 2 = � 풙 , 풙 � ◮ If 풙 = ( 0 , 0 , . . . , , . . . , 0 ) , then � 풙 , 풚 � = 푦 푖 1 �� 푥 푖 21

Dot Product The dot product of 풙 , 풚 ∈ ℝ 푛 is defined as 푛 � � 풙 , 풚 � = 풙 T 풚 = 푥 푖 푦 푖 (8) 푖 = 1 where 풙 T is the transpose of 풙 . ◮ � 풙 � 2 2 = � 풙 , 풙 � ◮ If 풙 = ( 0 , 0 , . . . , , . . . , 0 ) , then � 풙 , 풚 � = 푦 푖 1 �� 푥 푖 ◮ If 풙 is an unit vector ( � 풙 � 2 = 1 ), then � 풙 , 풚 � is the projection of 풚 on the direction of 풙 풚 풙 21

Frobenius Norm The Forbenius norm of a matrix A = [ 푎 푖,푗 ] ∈ ℝ 푚 × 푛 denoted by � · � 퐹 is defined as � A � 퐹 = � � � � 1 / 2 푎 2 (9) 푖,푗 푖 푗 ◮ The Frobenius norm can be interpreted as the ℓ 2 norm of a vector when treating A as a vector of size 푚푛 . 22

Two Special Matrices ◮ The identity matrix, denoted as I ∈ ℝ 푛 × 푛 ] , is a square matrix with ones on the diagonal and zeros everywhere else.   1     ... I = (10)       1   23

Two Special Matrices ◮ The identity matrix, denoted as I ∈ ℝ 푛 × 푛 ] , is a square matrix with ones on the diagonal and zeros everywhere else.   1     ... I = (10)       1   ◮ A diagonal matrix, denoted as D = diag ( 푑 1 , 푑 2 , . . . , 푑 푛 ) , is a matrix where all non-diagonal elements are 0.   푑 1     ... (11) D =       푑 푛   23

Inverse The inverse of a square matrix A ∈ ℝ 푛 × 푛 is denoted as A − 1 , which is the unique matrix such that A − 1 A = I = AA − 1 (12) ◮ Non-square matrices do not have inverses (by definition) ◮ Not all square matrices are invertible ◮ The solution of the linear equations in Eq. (3) is 풙 = A − 1 풃 24

Orthogonal Matrices ◮ Tw o vectors 풙 , 풚 ∈ ℝ 푛 are orthogonal if � 풙 , 풚 � = 0 풚 풙 25

Orthogonal Matrices ◮ Tw o vectors 풙 , 풚 ∈ ℝ 푛 are orthogonal if � 풙 , 풚 � = 0 풚 풙 ◮ A square matrix U ∈ ℝ 푛 × 푛 is orthogonal, if all its columns are orthogonal to each other and normalized (orthonormal) � 풖 푖 , 풖 푗 � = 0 , � 풖 푖 � = 1 , � 풖 푗 � = 1 (13) for 푖, 푗 ∈ [ 푛 ] and 푖 ≠ 푗 ◮ Furthermore, U T U = I = UU T , which further implies U − 1 = U T 25

Orthogonal Matrices ◮ Tw o vectors 풙 , 풚 ∈ ℝ 푛 are orthogonal if � 풙 , 풚 � = 0 풚 풙 ◮ A square matrix U ∈ ℝ 푛 × 푛 is orthogonal, if all its columns are orthogonal to each other and normalized (orthonormal) � 풖 푖 , 풖 푗 � = 0 , � 풖 푖 � = 1 , � 풖 푗 � = 1 (13) for 푖, 푗 ∈ [ 푛 ] and 푖 ≠ 푗 ◮ Furthermore, U T U = I = UU T , which further implies U − 1 = U T Problem : Create special matrices using PyTorch 25

Symmetric Matrices A symmetric matrix A ∈ ℝ 푛 × 푛 is defined as A T = A (14) or, in other words, 푎 푖,푗 = 푎 푗,푖 ∀ 푖, 푗 ∈ [ 푛 ] (15) Comments ◮ The identity matrix I is symmetric ◮ A diagonal matrix is symmetric 26

Quiz Quiz The identity matrix I is ◮ a diagonal matrix? ◮ a symmetric matrix? ◮ an orthogonal matrix? Further reference [Kolter, 2015] 27

Quiz Quiz The identity matrix I is ◮ a diagonal matrix? � ◮ a symmetric matrix? � ◮ an orthogonal matrix? � Further reference [Kolter, 2015] 27

Basic Probability Theory

What is Probability? The probability of landing heads is 0.52 29

Two interpretations Frequentist Probability represents the long-run frequency of an event ◮ If we flip the coin many times, we expect it to land heads about 52% times 30

CS 4501 Machine Learning for NLP Introduction Yangfeng Ji - PowerPoint PPT Presentation

CS 4501 Machine Learning for NLP Introduction Yangfeng Ji Department of Computer Science University of Virginia Overview 1. Course Information 2. Basic Linear Algebra 3. Basic Probability Theory 4. Statistical Estimation 1 About Online

CS 4501 Machine Learning for NLP Text Classification (I): Logistic Regression Yangfeng Ji

SI485i : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

SI425 : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

NLP: Two pictures Wordnet and Word Sense Problem NLP Disambiguation Semantics NLP Trinity

Recurrent Neural Networks Graham Neubig Site https://phontron.com/class/nn4nlp2017/ NLP and

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Ontologies for NLP NLP for Ontologies FOIS 2014 - LogOnto Workshop on Logics and Ontologies for

Deep Learning for NLP Kiran Vodrahalli Feb 11, 2015 Overview What is NLP? Natural

SI485i : NLP Set 12 Features and Prediction What is NLP, really? Many of our tasks boil down

Government Support Services Dean W. Stotler, CPPO Director Phone: 302-857-4501 Government

NOVOBIO A New 3D Bioprinting Solution Group 3: Changxiao Liang, Demany McKinnon, Hunter

UQ, STAT2201, 2017, Lecture 2, Unit 2, Probability and Monte Carlo. 1 Im willing to bet that

Section 7.2 Assigning Probabilities Laplaces definition from the previous section, assumes

T minus 7 classes Quiz on Probability next class Know material on the slides we covered

CSL202: Discrete Mathematical Structures Ragesh Jaiswal, CSE, IIT Delhi Ragesh Jaiswal, CSE, IIT

Uncertainty George Konidaris gdk@cs.duke.edu Spring 2016 Logic is Insufficient The world is not

Discrete Probability Each repetition of an experiment is called a trial . The result of each

Conditional Probability Independent & Related Events An independent event happens the same

Peer-to-Peer Networks 09 Random Graphs for Peer-to-Peer-Networks Christian Ortolf Technical

CS 4501 Machine Learning for NLP Introduction Yangfeng Ji - PowerPoint PPT Presentation

CS 4501 Machine Learning for NLP Introduction Yangfeng Ji Department of Computer Science University of Virginia Overview 1. Course Information 2. Basic Linear Algebra 3. Basic Probability Theory 4. Statistical Estimation 1 About Online

CS 4501 Machine Learning for NLP Text Classification (I): Logistic Regression Yangfeng Ji

SI485i : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

SI425 : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

NLP: Two pictures Wordnet and Word Sense Problem NLP Disambiguation Semantics NLP Trinity

Recurrent Neural Networks Graham Neubig Site https://phontron.com/class/nn4nlp2017/ NLP and

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Ontologies for NLP NLP for Ontologies FOIS 2014 - LogOnto Workshop on Logics and Ontologies for

Deep Learning for NLP Kiran Vodrahalli Feb 11, 2015 Overview What is NLP? Natural

SI485i : NLP Set 12 Features and Prediction What is NLP, really? Many of our tasks boil down

Government Support Services Dean W. Stotler, CPPO Director Phone: 302-857-4501 Government

NOVOBIO A New 3D Bioprinting Solution Group 3: Changxiao Liang, Demany McKinnon, Hunter

UQ, STAT2201, 2017, Lecture 2, Unit 2, Probability and Monte Carlo. 1 Im willing to bet that

Section 7.2 Assigning Probabilities Laplaces definition from the previous section, assumes

T minus 7 classes Quiz on Probability next class Know material on the slides we covered

CSL202: Discrete Mathematical Structures Ragesh Jaiswal, CSE, IIT Delhi Ragesh Jaiswal, CSE, IIT

Uncertainty George Konidaris gdk@cs.duke.edu Spring 2016 Logic is Insufficient The world is not

Discrete Probability Each repetition of an experiment is called a trial . The result of each

Conditional Probability Independent &amp; Related Events An independent event happens the same

Peer-to-Peer Networks 09 Random Graphs for Peer-to-Peer-Networks Christian Ortolf Technical

Conditional Probability Independent & Related Events An independent event happens the same