Sequential Supervised Learning Sequential Supervised Learning
Many Application Problems Require Many Application Problems Require Sequential Learning Sequential Learning Part- -of of- -speech Tagging speech Tagging Part Information Extraction from the Web Information Extraction from the Web Text- -to to- -Speech Mapping Speech Mapping Text
Part- -of of- -Speech Tagging Speech Tagging Part Given an English sentence, can we assign Given an English sentence, can we assign a part of speech to each word? a part of speech to each word? “Do you want fries with that? Do you want fries with that?” ” “ <verb pron pron verb noun prep verb noun prep pron pron> > <verb
Information Extraction from the Information Extraction from the Web Web <dl><dt><b>Srinivasan Seshan</b> (Carnegie Mellon University) <dt><a href=…><i>Making Virtual Worlds Real</i></a><dt>Tuesday, June 4, 2002<dd>2:00 PM , 322 Sieg<dd>Research Seminar * * * name name * * affiliation affiliation affiliation * * * * title title title title * * * date date date date * time time * location location * event-type event-type
Text- -to to- -Speech Mapping Speech Mapping Text “photograph photograph” ” => / => / f / “ - / f- -Ot@graf Ot@graf-
Sequential Supervised Learning Sequential Supervised Learning (SSL) (SSL) Given: A set of training examples of the Given: A set of training examples of the form ( X , Y ), where form ( i , i ), where X i Y i = h , … … , , x i and and i = i,1 , h x i i X i x i,1 x i,T X i,T i = h , … … , , y i are sequences of length are sequences of length i = i,1 , h y i i Y i y i,1 y i,T Y i,T i T i T i Find: A function f for predicting new Find: A function f for predicting new sequences: Y Y = = f( f( X ). sequences: X ).
Examples of of Examples Sequential Supervised Learning Sequential Supervised Learning Domain Input X Output Y Domain Input Output X i Y i i i Part- -of of- -speech speech sequence of sequence of Part sequence of sequence of Tagging words parts of speech Tagging words parts of speech Information sequence of sequence of field Information sequence of sequence of field Extraction tokens labels {name, … …} } Extraction tokens labels {name, Test- -to to- -speech speech sequence of sequence Test sequence of sequence Mapping letters phonemes Mapping letters phonemes
Two Kinds of Relationships Two Kinds of Relationships y 1 y 2 y 3 x 1 x 2 x 3 “Vertical Vertical” ” relationship between the relationship between the x ’s s and and y ’s s “ t ’ t ’ x t y t – Example: Example: “ “Friday Friday” ” is usually a is usually a “ “date date” ” – “Horizontal Horizontal” ” relationships among the relationships among the y ’s s “ t ’ y t – Example: Example: “ “name name” ” is usually followed by is usually followed by “ “affiliation affiliation” ” – SSL can (and should) exploit both kinds of SSL can (and should) exploit both kinds of information information
Existing Methods Existing Methods Hacks Hacks – Sliding Sliding windows windows – – Recurrent sliding windows Recurrent sliding windows – Hidden Markov models models Hidden Markov – joint distribution: P(X,Y) joint distribution: P(X,Y) – Conditional Random Fields Conditional Random Fields – conditional distribution: P(Y|X) conditional distribution: P(Y|X) – Discriminant Methods: HM- -SVMs, MMMs, voted SVMs, MMMs, voted Discriminant Methods: HM perceptrons perceptrons – discriminant function: f(Y; X) discriminant function: f(Y; X) –
Sliding Windows Sliding Windows ___ Do you want fries with that ___ ___ Do you want fries with that ___ ___ Do you verb ___ Do you verb → → Do you want pron Do you want pron → → you want fries verb you want fries verb → → want fries with noun want fries with noun → → fries with that prep fries with that prep → → with that ___ pron with that ___ pron → →
Properties of Sliding Windows Properties of Sliding Windows Converts SSL to ordinary supervised Converts SSL to ordinary supervised learning learning Only captures the relationship between Only captures the relationship between (part of) X and y . Does not explicitly (part of) X and t . Does not explicitly y t model relations among the y ’s s model relations among the t ’ y t Assumes each window is independent Assumes each window is independent
Recurrent Sliding Windows Recurrent Sliding Windows ___ Do you want fries with that ___ ___ Do you want fries with that ___ ___ Do you ___ verb ___ Do you ___ verb → → Do you want verb pron Do you want verb pron → → you want fries pron verb you want fries pron verb → → want fries with verb noun want fries with verb noun → → fries with that noun prep fries with that noun prep → → with that ___ prep pron with that ___ prep pron → →
Recurrent Sliding Windows Recurrent Sliding Windows Key Idea: Include y as input feature when Key Idea: Include t as input feature when y t computing y computing y t+1 . t+1 . During training: During training: – Use the correct value of Use the correct value of y – y t t – Or train iteratively (especially recurrent neural Or train iteratively (especially recurrent neural – networks) networks) During evaluation: During evaluation: – Use the predicted value of Use the predicted value of y – y t t
Properties of Recurrent Sliding Properties of Recurrent Sliding Windows Windows Captures relationship among the y y’ ’s s, but , but Captures relationship among the only in one direction! only in one direction! Results on text- -to to- -speech: speech: Results on text Method Direction Words Letters Method Direction Words Letters sliding window none 12.5% 69.6% sliding window none 12.5% 69.6% recurrent s. w. left- -right right 17.0% 67.9% recurrent s. w. left 17.0% 67.9% recurrent s. w. right- -left left 24.4% 74.2% recurrent s. w. right 24.4% 74.2%
Hidden Markov Models Hidden Markov Models Generalization of Naï ïve Bayes to SSL ve Bayes to SSL Generalization of Na y 1 y 2 y 3 y 4 y 5 x 1 x 2 x 3 x 4 x 5 P(y 1 ) P(y 1 ) P(y t | y t ) assumed the same for all t P(y t | y 1 ) assumed the same for all t t- -1 P( x | y t ) = P(x t,1 | y t ) · · P(x P(x t,2 | y t ) L L P(x P(x t,n ,y t ) P( t | y t ) = P(x t,1 | y t ) t,2 | y t ) t,n ,y t ) x t assumed the same for all t assumed the same for all t
Making Predictions with HMMs Making Predictions with HMMs Two possible goals: Two possible goals: – argmax argmax Y P(Y|X) – Y P(Y|X) find the most likely sequence sequence of labels Y given the of labels Y given the find the most likely input sequence X input sequence X – argmax argmax y P(y t | X) forall t – t P(y t | X) forall t yt find the most likely label y t at each time t given the find the most likely label y t at each time t given the entire input sequence X entire input sequence X
Finding Most Likely Label Sequence: Finding Most Likely Label Sequence: The Trellis The Trellis verb pronoun verb noun noun s verb 1 1 1 1 1 pronoun 2 2 2 2 2 noun 3 3 3 3 3 adjective 4 4 4 4 4 f Do you want fries sir? Every label sequence corresponds to a path through the trellis graph. The probability of a label sequence is proportional to P(y 1 ) · P(x 1 |y 1 ) · P(y 2 |y 1 ) · P(x 2 |y 2 ) L P(y T | y T-1 ) · P(x T | y T )
Converting to Shortest Path Problem Converting to Shortest Path Problem verb pronoun verb noun noun s verb 1 1 1 1 1 pronoun 2 2 2 2 2 noun 3 3 3 3 3 adjective 4 4 4 4 4 f Do you want fries sir? max y1,…,yT P(y 1 ) · P(x 1 |y 1 ) · P(y 2 |y 1 ) · P(x 2 |y 2 ) L P(y T | y T-1 ) · P(x T | y T ) = min y1,…,yT l –log [P(y 1 ) · P(x 1 |y 1 )] + –log [P(y 2 |y 1 ) · P(x 2 |y 2 )] + L + –log [P(y T | y T-1 ) · P(x T | y T )] shortest path through graph. edge cost = –log [P(y t |y t-1 ) · P(x t |y t )]
Finding Most Likely Label Sequence: Finding Most Likely Label Sequence: The Viterbi Algorithm The Viterbi Algorithm verb pronoun verb noun noun s verb 1 1 1 1 1 pronoun 2 2 2 2 2 noun 3 3 3 3 3 adjective 4 4 4 4 4 f Do you want fries sir? Step t of the Viterbi algorithm computes the possible successors of state y t-1 _ and computes the total path length for each edge
Finding Most Likely Label Sequence: Finding Most Likely Label Sequence: The Viterbi Algorithm The Viterbi Algorithm verb pronoun verb noun noun s verb 1 1 1 1 1 pronoun 2 2 2 2 2 noun 3 3 3 3 3 adjective 4 4 4 4 4 f Do you want fries sir? Each node y t =k stores the cost µ of the shortest path that reaches it from s and the predecessor class y t-1 = k’ that achieves this cost k’ = argmin yt-1 –log [P(y t | y t-1 ) · P(x t | y t )] + µ (y t-1 ) µ( k) = min yt-1 –log [P(y t | y t-1 ) · P(x t | y t )] + µ (y t-1 )
Recommend
More recommend