november 15 th 2017
play

November 15 th , 2017 Natural Language Processing Startup Joint - PowerPoint PPT Presentation

DATA130006 Text Management and Analysis Sequence Labeling School of Data Science, Fudan University November 15 th , 2017 Natural Language Processing Startup Joint Distributions A joint


  1. DATA130006 Text Management and Analysis Sequence Labeling 魏忠钰 复旦大学大数据学院 School of Data Science, Fudan University November 15 th , 2017

  2. Natural Language Processing Startup ▪ 深度好奇

  3. Joint Distributions ▪ A joint distribution over a set of random variables specifies a real number for each assignment (or outcome ): T W P hot sun 0.4 ▪ Must obey: hot rain 0.1 cold sun 0.2 cold rain 0.3 ▪ Size of distribution if n variables with domain sizes d? ▪ Impractical to write out!

  4. Marginal Distributions ▪ Marginal distributions are sub-tables which eliminate variables ▪ Marginalization (summing out): Combine collapsed rows by adding T P T W P hot 0.5 hot sun 0.4 cold 0.5 hot rain 0.1 cold sun 0.2 W P cold rain 0.3 sun 0.6 rain 0.4

  5. Conditional Probabilities ▪ A simple relation between joint and conditional probabilities ▪ In fact, this is taken as the definition of a conditional probability P(a,b) T W P hot sun 0.4 P(a) P(b) hot rain 0.1 cold sun 0.2 cold rain 0.3

  6. Conditional Independence ▪ Unconditional (absolute) independence very rare ▪ Conditional independence is our most basic and robust form of knowledge about uncertain environments. ▪ X is conditionally independent of Y given Z if and only if: or, equivalently, if and only if

  7. Outline ▪ Markov Model ▪ Hidden Markov Model ▪ Hidden Markov Model for Sequence Labeling ▪ Maximum Entropy Markov Model for Sequence Labeling

  8. Markov Model ▪ Value of X at a given time is called the state X 1 X 2 X 3 X 4 ▪ Parameters: called transition probabilities or dynamics, specify how the state evolves over time (also, initial state probabilities) ▪ Stationarity assumption: transition probabilities the same at all times

  9. Joint Distribution of a Markov Model X 1 X 2 X 3 X 4 ▪ Joint distribution: ▪ More generally:

  10. Example Markov Chain: Weather ▪ States: X = {rain, sun} ▪ Initial distribution: 1.0 sun ▪ CPT P(X t | X t-1 ): Two new ways of representing the same CPT X t-1 X t P(X t |X t-1 ) 0.9 0.3 sun sun 0.9 0.9 sun sun rain sun sun rain 0.1 0.1 0.3 rain sun 0.3 rain rain rain rain 0.7 0.7 0.7 0.1

  11. Mini-Forward Algorithm ▪ Question: What’ s P(X) on some day t? X 1 X 2 X 3 X 4 Forward simulation

  12. Example Run of Forward Algorithm X t-1 X t P(X t |X t-1 ) sun sun 0.9 sun rain 0.1 rain sun 0.3 ▪ From initial observation of sun rain rain 0.7 P( X 4 ) P( X 1 ) P( X 2 ) P( X 3 ) P( X  ) ▪ From initial observation of rain P( X 1 ) P( X 2 ) P( X 3 ) P( X 4 ) P( X  ) ▪ From yet another initial distribution P(X 1 ): … P( X 1 ) P( X  )

  13. Stationary Distributions ▪ For most chains: ▪ Stationary distribution: ▪ Influence of the initial distribution ▪ The distribution we end up gets less and less over time. with is called the stationary distribution of the chain ▪ The distribution we end up in is ▪ It satisfies independent of the initial distribution

  14. Example: Stationary Distributions ▪ Question: What’ s P(X) at time t = infinity? X 1 X 2 X 3 X 4 X t-1 X t P(X t |X t-1 ) sun sun 0.9 sun rain 0.1 rain sun 0.3 rain rain 0.7 Also:

  15. Stationary Distribution for Weblink analysis ▪ PageRank over a web graph ▪ Each web page is a state ▪ Initial distribution: uniform over pages ▪ Transitions: ▪ With prob. c, uniform jump to a random page (dotted lines, not all shown) ▪ With prob. 1-c, follow a random outlink (solid lines) ▪ Stationary distribution ▪ Will spend more time on highly reachable pages ▪ Somewhat robust to link spam ▪ Google 1.0 returned the set of pages containing all your keywords in decreasing rank, now all search engines use link analysis along with many other factors (rank actually getting less important over time)

  16. Text as a Graph ▪ Node stands for sentences ▪ Edge stands for similarity

  17. Centrality-based Summarization ▪ Assumption: The centrality of the node is an indication of its importance ▪ Representation: Connectivity Matrix based on intro- sentence cosine similarity ▪ Extraction Mechanism ▪ Compute PageRank score for every sentence u ▪ Extract k sentences with the highest PageRanks score

  18. Outline ▪ Markov Model ▪ Hidden Markov Model ▪ Hidden Markov Model for Sequence Labeling ▪ Maximum Entropy Markov Model for Sequence Labeling

  19. Hidden Markov Model ▪ Hidden Markov models (HMMs) ▪ Underlying Markov chain over states X ▪ You observe outputs (effects) at each time step X 1 X 2 X 3 X 4 X 5 E 1 E 2 E 3 E 4 E 5

  20. Example: Weather HMM Rain t-1 Rain t+1 Rain t R t R t+1 P(R t+1 |R t ) Umbrella t-1 Umbrella t-1 Umbrella t-1 +r +r 0.7 +r -r 0.3 -r +r 0.3 ▪ An HMM is defined by: -r -r 0.7 ▪ Initial distribution: R t U t P(U t |R t ) ▪ Transitions: +r +u 0.9 ▪ Emissions: +r -u 0.1 -r +u 0.2 -r -u 0.8

  21. Conditional Independence ▪ HMMs have two important independence properties: ▪ Markov hidden process: future depends on past via the present ▪ Current observation independent of all else given current state X 1 X 2 X 3 X 4 X 5 E 1 E 2 E 3 E 4 E 5 ▪ Does this mean that evidence variables are guaranteed to be independent? ▪ [No, they tend to correlated by the hidden state]

  22. Chain Rule and HMMs ▪ From the chain rule, every joint distribution over can be written as: ▪ Assuming that for all t : ▪ State independent of all past states and all past evidence given the previous state, i.e.: ▪ Evidence is independent of all past states and all past evidence given the current state, i.e.: So, we have:

  23. Tasks for HMM ▪ Filtering ▪ Computing the belief state — the posterior distribution over the most recent state — given all evidence to date. ▪ 𝑸 ( 𝒀 𝒖 |𝒇 𝟐:𝒖 ) ▪ Prediction ▪ Computing the posterior distribution over the future state, given all evidence to date. ▪ 𝑸 ( 𝒀 𝒖+𝒍 |𝒇 𝟐:𝒖 ) ▪ Smoothing ▪ Computing the posterior distribution over a past state, given all evidence up to the present. ▪ 𝑸 ( 𝒀 𝒍 |𝒇 𝟐:𝒖 ) ▪ Most Likely Explanation ▪ Given a sequence of observations, find the sequence of states that is most likely to have generated those observations.

  24. Real HMM Examples ▪ Speech recognition HMMs: ▪ Observations are acoustic signals (continuous valued) ▪ States are specific positions in specific words (so, tens of thousands) ▪ Machine translation HMMs: ▪ Observations are words (tens of thousands) ▪ States are translation options ▪ Robot tracking: ▪ Observations are range readings (continuous) ▪ States are positions on a map (continuous)

  25. Filtering / Monitoring ▪ Filtering, or monitoring, is the task of tracking the distribution B t (X) = P t (X t | e 1 , …, e t ) (the belief state) over time ▪ We start with B 1 (X) in an initial setting, usually uniform ▪ As time passes, or we get observations, we update B(X) ▪ The Kalman filter was invented in the 60’ s and first implemented as a method of trajectory estimation for the Apollo program

  26. Inference: Base Cases X 1 X 1 X 2 E 1

  27. Passage of Time ▪ Assume we have current belief P(X | evidence to date) X 1 X 2 ▪ Then, after one time step passes: ▪ Or compactly:

  28. Observation ▪ Assume we have current belief P(X | previous evidence): ▪ Then, after evidence comes in: ▪ Or, compactly:

  29. The Forward Algorithm ▪ We are given evidence at each time and want to know ▪ We can derive the following updates We can normalize as we go if we want to have P(x|e) at each time step, or just once at the end…

  30. Online Belief Updates ▪ Every time step, we start with current P(X | evidence) ▪ We update for time: ▪ We update for evidence: ▪ The forward algorithm does both at once X 2 X 1 X 2 E 2

  31. In-class Quiz R t R t+1 P(R t+1 |R t ) +r +r 0.7 +r -r 0.3 B(+r) = 0.5 B(+r) B(+r) B(-r) = 0.5 B(-r) B(-r) -r +r 0.3 -r -r 0.7 Rain 0 Rain 1 Rain 2 R t U t P(U t |R t ) +r +u 0.9 +r -u 0.1 Umbrella 1 Umbrella 2 -r +u 0.2 -r -u 0.8

  32. quiz: Weather HMM R t R t+1 P(R t+1 |R t ) B’(+r) = 0.5 B’(+r) = 0.627 B’( -r) = 0.5 B’( -r) = 0.373 +r +r 0.7 +r -r 0.3 -r +r 0.3 B(+r) = 0.818 B(+r) = 0.5 B(+r) = 0.883 B(-r) = 0.182 B(-r) = 0.5 B(-r) = 0.117 -r -r 0.7 Rain 0 Rain 1 Rain 2 R t U t P(U t |R t ) +r +u 0.9 +r -u 0.1 Umbrella 1 Umbrella 2 -r +u 0.2 -r -u 0.8

  33. Most Likely Explanation

  34. HMMs: MLE Queries ▪ HMMs defined by ▪ States X ▪ Observations E ▪ Initial distribution: ▪ Transitions: ▪ Emissions: ▪ New query: most likely explanation: X 1 X 2 X 3 X 4 E 1 E 2 E 3 E 4

  35. HMMs: MLE Queries ▪ Graph of states and transitions over time sun sun sun sun rain rain rain rain ▪ Each arc represents some transition ▪ Each arc has weight ▪ Each path is a sequence of states ▪ The product of weights on a path is that sequence’s probability along with the evidence ▪ Forward algorithm computes sums of paths, Viterbi computes best paths

  36. HMMs: MLE Queries sun sun sun sun rain rain rain rain Viterbi Algorithm (Max) Forward Algorithm (Sum)

  37. Outline ▪ Markov Model ▪ Hidden Markov Model ▪ Hidden Markov Model for Sequence Labeling ▪ Maximum Entropy Markov Model for Sequence Labeling

Recommend


More recommend