DATA130006 Text Management and Analysis Sequence Labeling 魏忠钰 复旦大学大数据学院 School of Data Science, Fudan University November 15 th , 2017
Natural Language Processing Startup ▪ 深度好奇
Joint Distributions ▪ A joint distribution over a set of random variables specifies a real number for each assignment (or outcome ): T W P hot sun 0.4 ▪ Must obey: hot rain 0.1 cold sun 0.2 cold rain 0.3 ▪ Size of distribution if n variables with domain sizes d? ▪ Impractical to write out!
Marginal Distributions ▪ Marginal distributions are sub-tables which eliminate variables ▪ Marginalization (summing out): Combine collapsed rows by adding T P T W P hot 0.5 hot sun 0.4 cold 0.5 hot rain 0.1 cold sun 0.2 W P cold rain 0.3 sun 0.6 rain 0.4
Conditional Probabilities ▪ A simple relation between joint and conditional probabilities ▪ In fact, this is taken as the definition of a conditional probability P(a,b) T W P hot sun 0.4 P(a) P(b) hot rain 0.1 cold sun 0.2 cold rain 0.3
Conditional Independence ▪ Unconditional (absolute) independence very rare ▪ Conditional independence is our most basic and robust form of knowledge about uncertain environments. ▪ X is conditionally independent of Y given Z if and only if: or, equivalently, if and only if
Outline ▪ Markov Model ▪ Hidden Markov Model ▪ Hidden Markov Model for Sequence Labeling ▪ Maximum Entropy Markov Model for Sequence Labeling
Markov Model ▪ Value of X at a given time is called the state X 1 X 2 X 3 X 4 ▪ Parameters: called transition probabilities or dynamics, specify how the state evolves over time (also, initial state probabilities) ▪ Stationarity assumption: transition probabilities the same at all times
Joint Distribution of a Markov Model X 1 X 2 X 3 X 4 ▪ Joint distribution: ▪ More generally:
Example Markov Chain: Weather ▪ States: X = {rain, sun} ▪ Initial distribution: 1.0 sun ▪ CPT P(X t | X t-1 ): Two new ways of representing the same CPT X t-1 X t P(X t |X t-1 ) 0.9 0.3 sun sun 0.9 0.9 sun sun rain sun sun rain 0.1 0.1 0.3 rain sun 0.3 rain rain rain rain 0.7 0.7 0.7 0.1
Mini-Forward Algorithm ▪ Question: What’ s P(X) on some day t? X 1 X 2 X 3 X 4 Forward simulation
Example Run of Forward Algorithm X t-1 X t P(X t |X t-1 ) sun sun 0.9 sun rain 0.1 rain sun 0.3 ▪ From initial observation of sun rain rain 0.7 P( X 4 ) P( X 1 ) P( X 2 ) P( X 3 ) P( X ) ▪ From initial observation of rain P( X 1 ) P( X 2 ) P( X 3 ) P( X 4 ) P( X ) ▪ From yet another initial distribution P(X 1 ): … P( X 1 ) P( X )
Stationary Distributions ▪ For most chains: ▪ Stationary distribution: ▪ Influence of the initial distribution ▪ The distribution we end up gets less and less over time. with is called the stationary distribution of the chain ▪ The distribution we end up in is ▪ It satisfies independent of the initial distribution
Example: Stationary Distributions ▪ Question: What’ s P(X) at time t = infinity? X 1 X 2 X 3 X 4 X t-1 X t P(X t |X t-1 ) sun sun 0.9 sun rain 0.1 rain sun 0.3 rain rain 0.7 Also:
Stationary Distribution for Weblink analysis ▪ PageRank over a web graph ▪ Each web page is a state ▪ Initial distribution: uniform over pages ▪ Transitions: ▪ With prob. c, uniform jump to a random page (dotted lines, not all shown) ▪ With prob. 1-c, follow a random outlink (solid lines) ▪ Stationary distribution ▪ Will spend more time on highly reachable pages ▪ Somewhat robust to link spam ▪ Google 1.0 returned the set of pages containing all your keywords in decreasing rank, now all search engines use link analysis along with many other factors (rank actually getting less important over time)
Text as a Graph ▪ Node stands for sentences ▪ Edge stands for similarity
Centrality-based Summarization ▪ Assumption: The centrality of the node is an indication of its importance ▪ Representation: Connectivity Matrix based on intro- sentence cosine similarity ▪ Extraction Mechanism ▪ Compute PageRank score for every sentence u ▪ Extract k sentences with the highest PageRanks score
Outline ▪ Markov Model ▪ Hidden Markov Model ▪ Hidden Markov Model for Sequence Labeling ▪ Maximum Entropy Markov Model for Sequence Labeling
Hidden Markov Model ▪ Hidden Markov models (HMMs) ▪ Underlying Markov chain over states X ▪ You observe outputs (effects) at each time step X 1 X 2 X 3 X 4 X 5 E 1 E 2 E 3 E 4 E 5
Example: Weather HMM Rain t-1 Rain t+1 Rain t R t R t+1 P(R t+1 |R t ) Umbrella t-1 Umbrella t-1 Umbrella t-1 +r +r 0.7 +r -r 0.3 -r +r 0.3 ▪ An HMM is defined by: -r -r 0.7 ▪ Initial distribution: R t U t P(U t |R t ) ▪ Transitions: +r +u 0.9 ▪ Emissions: +r -u 0.1 -r +u 0.2 -r -u 0.8
Conditional Independence ▪ HMMs have two important independence properties: ▪ Markov hidden process: future depends on past via the present ▪ Current observation independent of all else given current state X 1 X 2 X 3 X 4 X 5 E 1 E 2 E 3 E 4 E 5 ▪ Does this mean that evidence variables are guaranteed to be independent? ▪ [No, they tend to correlated by the hidden state]
Chain Rule and HMMs ▪ From the chain rule, every joint distribution over can be written as: ▪ Assuming that for all t : ▪ State independent of all past states and all past evidence given the previous state, i.e.: ▪ Evidence is independent of all past states and all past evidence given the current state, i.e.: So, we have:
Tasks for HMM ▪ Filtering ▪ Computing the belief state — the posterior distribution over the most recent state — given all evidence to date. ▪ 𝑸 ( 𝒀 𝒖 |𝒇 𝟐:𝒖 ) ▪ Prediction ▪ Computing the posterior distribution over the future state, given all evidence to date. ▪ 𝑸 ( 𝒀 𝒖+𝒍 |𝒇 𝟐:𝒖 ) ▪ Smoothing ▪ Computing the posterior distribution over a past state, given all evidence up to the present. ▪ 𝑸 ( 𝒀 𝒍 |𝒇 𝟐:𝒖 ) ▪ Most Likely Explanation ▪ Given a sequence of observations, find the sequence of states that is most likely to have generated those observations.
Real HMM Examples ▪ Speech recognition HMMs: ▪ Observations are acoustic signals (continuous valued) ▪ States are specific positions in specific words (so, tens of thousands) ▪ Machine translation HMMs: ▪ Observations are words (tens of thousands) ▪ States are translation options ▪ Robot tracking: ▪ Observations are range readings (continuous) ▪ States are positions on a map (continuous)
Filtering / Monitoring ▪ Filtering, or monitoring, is the task of tracking the distribution B t (X) = P t (X t | e 1 , …, e t ) (the belief state) over time ▪ We start with B 1 (X) in an initial setting, usually uniform ▪ As time passes, or we get observations, we update B(X) ▪ The Kalman filter was invented in the 60’ s and first implemented as a method of trajectory estimation for the Apollo program
Inference: Base Cases X 1 X 1 X 2 E 1
Passage of Time ▪ Assume we have current belief P(X | evidence to date) X 1 X 2 ▪ Then, after one time step passes: ▪ Or compactly:
Observation ▪ Assume we have current belief P(X | previous evidence): ▪ Then, after evidence comes in: ▪ Or, compactly:
The Forward Algorithm ▪ We are given evidence at each time and want to know ▪ We can derive the following updates We can normalize as we go if we want to have P(x|e) at each time step, or just once at the end…
Online Belief Updates ▪ Every time step, we start with current P(X | evidence) ▪ We update for time: ▪ We update for evidence: ▪ The forward algorithm does both at once X 2 X 1 X 2 E 2
In-class Quiz R t R t+1 P(R t+1 |R t ) +r +r 0.7 +r -r 0.3 B(+r) = 0.5 B(+r) B(+r) B(-r) = 0.5 B(-r) B(-r) -r +r 0.3 -r -r 0.7 Rain 0 Rain 1 Rain 2 R t U t P(U t |R t ) +r +u 0.9 +r -u 0.1 Umbrella 1 Umbrella 2 -r +u 0.2 -r -u 0.8
quiz: Weather HMM R t R t+1 P(R t+1 |R t ) B’(+r) = 0.5 B’(+r) = 0.627 B’( -r) = 0.5 B’( -r) = 0.373 +r +r 0.7 +r -r 0.3 -r +r 0.3 B(+r) = 0.818 B(+r) = 0.5 B(+r) = 0.883 B(-r) = 0.182 B(-r) = 0.5 B(-r) = 0.117 -r -r 0.7 Rain 0 Rain 1 Rain 2 R t U t P(U t |R t ) +r +u 0.9 +r -u 0.1 Umbrella 1 Umbrella 2 -r +u 0.2 -r -u 0.8
Most Likely Explanation
HMMs: MLE Queries ▪ HMMs defined by ▪ States X ▪ Observations E ▪ Initial distribution: ▪ Transitions: ▪ Emissions: ▪ New query: most likely explanation: X 1 X 2 X 3 X 4 E 1 E 2 E 3 E 4
HMMs: MLE Queries ▪ Graph of states and transitions over time sun sun sun sun rain rain rain rain ▪ Each arc represents some transition ▪ Each arc has weight ▪ Each path is a sequence of states ▪ The product of weights on a path is that sequence’s probability along with the evidence ▪ Forward algorithm computes sums of paths, Viterbi computes best paths
HMMs: MLE Queries sun sun sun sun rain rain rain rain Viterbi Algorithm (Max) Forward Algorithm (Sum)
Outline ▪ Markov Model ▪ Hidden Markov Model ▪ Hidden Markov Model for Sequence Labeling ▪ Maximum Entropy Markov Model for Sequence Labeling
Recommend
More recommend