Learning how to Active Learn: A Deep Reinforcement Learning Approach Meng Fang, Yuan Li, Trevor Cohn The University of Melbourne Presenter: Jialin Song April 05, 2018 April 05, 2018 1 / 17 Meng Fang, Yuan Li, Trevor Cohn CS 546 Machine Learning in NLP
Overview Introduction 1 Model 2 Algorithms 3 Numerical Experiments 4 April 05, 2018 2 / 17 Meng Fang, Yuan Li, Trevor Cohn CS 546 Machine Learning in NLP
Introduction: Active Learning 1 Annotation: April 05, 2018 3 / 17 Meng Fang, Yuan Li, Trevor Cohn CS 546 Machine Learning in NLP
Introduction: Active Learning 1 Annotation: ⋄ select a subset of data to annotate from a large unlabelled dataset (adding labels) April 05, 2018 3 / 17 Meng Fang, Yuan Li, Trevor Cohn CS 546 Machine Learning in NLP
Introduction: Active Learning 1 Annotation: ⋄ select a subset of data to annotate from a large unlabelled dataset (adding labels) ⋄ then we can train a supervised learning model φ (classifier) April 05, 2018 3 / 17 Meng Fang, Yuan Li, Trevor Cohn CS 546 Machine Learning in NLP
Introduction: Active Learning 1 Annotation: ⋄ select a subset of data to annotate from a large unlabelled dataset (adding labels) ⋄ then we can train a supervised learning model φ (classifier) ⋄ we hope to maximize the accuracy of the classification model April 05, 2018 3 / 17 Meng Fang, Yuan Li, Trevor Cohn CS 546 Machine Learning in NLP
Introduction: Active Learning 1 Annotation: ⋄ select a subset of data to annotate from a large unlabelled dataset (adding labels) ⋄ then we can train a supervised learning model φ (classifier) ⋄ we hope to maximize the accuracy of the classification model 2 Active learning: April 05, 2018 3 / 17 Meng Fang, Yuan Li, Trevor Cohn CS 546 Machine Learning in NLP
Introduction: Active Learning 1 Annotation: ⋄ select a subset of data to annotate from a large unlabelled dataset (adding labels) ⋄ then we can train a supervised learning model φ (classifier) ⋄ we hope to maximize the accuracy of the classification model 2 Active learning: ⋄ there is high cost annotating every sentence April 05, 2018 3 / 17 Meng Fang, Yuan Li, Trevor Cohn CS 546 Machine Learning in NLP
Introduction: Active Learning 1 Annotation: ⋄ select a subset of data to annotate from a large unlabelled dataset (adding labels) ⋄ then we can train a supervised learning model φ (classifier) ⋄ we hope to maximize the accuracy of the classification model 2 Active learning: ⋄ there is high cost annotating every sentence ⋄ how to select raw data to add labels in order to maximize the accuracy of the classification model April 05, 2018 3 / 17 Meng Fang, Yuan Li, Trevor Cohn CS 546 Machine Learning in NLP
Introduction: Active Learning 1 Annotation: ⋄ select a subset of data to annotate from a large unlabelled dataset (adding labels) ⋄ then we can train a supervised learning model φ (classifier) ⋄ we hope to maximize the accuracy of the classification model 2 Active learning: ⋄ there is high cost annotating every sentence ⋄ how to select raw data to add labels in order to maximize the accuracy of the classification model ⋄ active learning becomes a sequential decision: as each sentence arrives, annotate it or not (our action) April 05, 2018 3 / 17 Meng Fang, Yuan Li, Trevor Cohn CS 546 Machine Learning in NLP
Introduction: MDP 1 Markov Decision Process (MDP): April 05, 2018 4 / 17 Meng Fang, Yuan Li, Trevor Cohn CS 546 Machine Learning in NLP
Introduction: MDP 1 Markov Decision Process (MDP): ⋄ a framework to model a sequential decision process April 05, 2018 4 / 17 Meng Fang, Yuan Li, Trevor Cohn CS 546 Machine Learning in NLP
Introduction: MDP 1 Markov Decision Process (MDP): ⋄ a framework to model a sequential decision process ⋄ in each decision stage, agent observes state variables ( s ) and take a action ( a ) to maximize its current payoff April 05, 2018 4 / 17 Meng Fang, Yuan Li, Trevor Cohn CS 546 Machine Learning in NLP
Introduction: MDP 1 Markov Decision Process (MDP): ⋄ a framework to model a sequential decision process ⋄ in each decision stage, agent observes state variables ( s ) and take a action ( a ) to maximize its current payoff ⋄ after taking the action, a reward associated with the action and state ( r ( s, a ) ) is generated and current state transits to next state April 05, 2018 4 / 17 Meng Fang, Yuan Li, Trevor Cohn CS 546 Machine Learning in NLP
Introduction: MDP 1 Markov Decision Process (MDP): ⋄ a framework to model a sequential decision process ⋄ in each decision stage, agent observes state variables ( s ) and take a action ( a ) to maximize its current payoff ⋄ after taking the action, a reward associated with the action and state ( r ( s, a ) ) is generated and current state transits to next state ⋄ agent aims maximizing the expected value of rewards over all stages April 05, 2018 4 / 17 Meng Fang, Yuan Li, Trevor Cohn CS 546 Machine Learning in NLP
Introduction: Bellman Equation 1 The dynamics of MDP can be modeled in Bellman equations April 05, 2018 5 / 17 Meng Fang, Yuan Li, Trevor Cohn CS 546 Machine Learning in NLP
Introduction: Bellman Equation 1 The dynamics of MDP can be modeled in Bellman equations ⋄ Bellman equation 1: value function � � � P ss ′ ( a ) J ( s ′ ) J ( s ) = max r ( s, a ) + α ¯ a s ′ a ∗ s = argmax J ( s ) April 05, 2018 5 / 17 Meng Fang, Yuan Li, Trevor Cohn CS 546 Machine Learning in NLP
Introduction: Bellman Equation 1 The dynamics of MDP can be modeled in Bellman equations ⋄ Bellman equation 1: value function � � � P ss ′ ( a ) J ( s ′ ) J ( s ) = max r ( s, a ) + α ¯ a s ′ a ∗ s = argmax J ( s ) ⋄ Bellman equation 2 (more common!): Q-function � Q ( s ′ , u ) Q ( s, a ) = ¯ r ( s, a ) + α P ss ′ ( a ) max u s ′ a ∗ s = argmax Q ( s, a ) April 05, 2018 5 / 17 Meng Fang, Yuan Li, Trevor Cohn CS 546 Machine Learning in NLP
Introduction: Bellman Equation 1 The dynamics of MDP can be modeled in Bellman equations ⋄ Bellman equation 1: value function � � � P ss ′ ( a ) J ( s ′ ) J ( s ) = max r ( s, a ) + α ¯ a s ′ a ∗ s = argmax J ( s ) ⋄ Bellman equation 2 (more common!): Q-function � Q ( s ′ , u ) Q ( s, a ) = ¯ r ( s, a ) + α P ss ′ ( a ) max u s ′ a ∗ s = argmax Q ( s, a ) ⋄ where ¯ r ( s, a ) is the expected reward, P ss ′ ( a ) is the transition probability from state s to s ′ , α is the discount of reward April 05, 2018 5 / 17 Meng Fang, Yuan Li, Trevor Cohn CS 546 Machine Learning in NLP
Q-Learning 1 If P ss ′ ( a ) is known, then solve the Bellmen equations (VI/PI) to get the optimal policy. There is no need to ’learn’!!! April 05, 2018 6 / 17 Meng Fang, Yuan Li, Trevor Cohn CS 546 Machine Learning in NLP
Q-Learning 1 If P ss ′ ( a ) is known, then solve the Bellmen equations (VI/PI) to get the optimal policy. There is no need to ’learn’!!! 2 If P ss ′ ( a ) is not known, then how to compute Q-function becomes a learning problem April 05, 2018 6 / 17 Meng Fang, Yuan Li, Trevor Cohn CS 546 Machine Learning in NLP
Q-Learning 1 If P ss ′ ( a ) is known, then solve the Bellmen equations (VI/PI) to get the optimal policy. There is no need to ’learn’!!! 2 If P ss ′ ( a ) is not known, then how to compute Q-function becomes a learning problem 3 Q-learning: April 05, 2018 6 / 17 Meng Fang, Yuan Li, Trevor Cohn CS 546 Machine Learning in NLP
Q-Learning 1 If P ss ′ ( a ) is known, then solve the Bellmen equations (VI/PI) to get the optimal policy. There is no need to ’learn’!!! 2 If P ss ′ ( a ) is not known, then how to compute Q-function becomes a learning problem 3 Q-learning: � � ⋄ Q t +1 ( s t , a t ) = (1 − ǫ t ) Q t ( s t , a t ) + ǫ t r ( s t , a t ) + α max u Q t ( s t +1 , u ) ¯ April 05, 2018 6 / 17 Meng Fang, Yuan Li, Trevor Cohn CS 546 Machine Learning in NLP
Q-Learning 1 If P ss ′ ( a ) is known, then solve the Bellmen equations (VI/PI) to get the optimal policy. There is no need to ’learn’!!! 2 If P ss ′ ( a ) is not known, then how to compute Q-function becomes a learning problem 3 Q-learning: � � ⋄ Q t +1 ( s t , a t ) = (1 − ǫ t ) Q t ( s t , a t ) + ǫ t r ( s t , a t ) + α max u Q t ( s t +1 , u ) ¯ ⋄ where t is iteration and ǫ t is the learning rate April 05, 2018 6 / 17 Meng Fang, Yuan Li, Trevor Cohn CS 546 Machine Learning in NLP
Q-Learning 1 If P ss ′ ( a ) is known, then solve the Bellmen equations (VI/PI) to get the optimal policy. There is no need to ’learn’!!! 2 If P ss ′ ( a ) is not known, then how to compute Q-function becomes a learning problem 3 Q-learning: � � ⋄ Q t +1 ( s t , a t ) = (1 − ǫ t ) Q t ( s t , a t ) + ǫ t r ( s t , a t ) + α max u Q t ( s t +1 , u ) ¯ ⋄ where t is iteration and ǫ t is the learning rate ⋄ In practice, above is useless: | S | × | A | is huge April 05, 2018 6 / 17 Meng Fang, Yuan Li, Trevor Cohn CS 546 Machine Learning in NLP
Deep Q-Learning 1 Deep Q-learning: April 05, 2018 7 / 17 Meng Fang, Yuan Li, Trevor Cohn CS 546 Machine Learning in NLP
Deep Q-Learning 1 Deep Q-learning: ⋄ use the output of a DNN parametrized by θ , i.e., f θ ( s, u ) to approximate Q ( s, a ) : April 05, 2018 7 / 17 Meng Fang, Yuan Li, Trevor Cohn CS 546 Machine Learning in NLP
Recommend
More recommend