10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Deep RL + K-Means Matt Gormley Lecture 25 Apr. 17, 2019 1
Reminders • Homework 8: Reinforcement Learning – Out: Wed, Apr 10 – Due: Wed, Apr 24 at 11:59pm • Today’s In-Class Poll – http://p25 .mlcourse.org 2
Q&A Q: Do we have to retrain our RL agent every time we change our state space? A: Yes. But whether your state space changes from one setting to another is determined by your design of the state representation. Two examples: – State Space A: <x,y> position on map e.g. s t = <74, 152> – State Space B: window of pixel colors centered at current Pac Man location e.g. s t = 0 1 0 0 0 0 1 1 1 3
DEEP RL EXAMPLES 6
TD Gammon à Alpha Go Learning to beat the masters at board games THEN NOW “…the world’s top computer program for backgammon, TD-GAMMON (Tesauro, 1992, 1995), learned its strategy by playing over one million practice games against itself…” (Mitchell, 1997) 7
Playing Atari with Deep RL • Setup: RL system observation action observes the O t A t pixels on the screen • It receives reward R t rewards as the game score • Actions decide how to move the joystick / buttons 8 Figures from David Silver (Intro RL lecture)
Playing Atari with Deep RL Figure 1: Screen shots from five Atari 2600 Games: ( Left-to-right ) Pong, Breakout, Space Invaders, Seaquest, Beam Rider Videos: – Atari Breakout: https://www.youtube.com/watch?v=V1eYniJ0Rn k – Space Invaders: https://www.youtube.com/watch?v=ePv0Fs9cG gU 9 Figures from Mnih et al. (2013)
Playing Atari with Deep RL Figure 1: Screen shots from five Atari 2600 Games: ( Left-to-right ) Pong, Breakout, Space Invaders, Seaquest, Beam Rider B. Rider Breakout Enduro Pong Q*bert Seaquest S. Invaders 354 1 . 2 0 − 20 . 4 157 110 179 Random Sarsa [3] 996 5 . 2 129 − 19 614 665 271 1743 6 159 − 17 960 723 268 Contingency [4] 4092 168 470 20 1952 1705 581 DQN Human 7456 31 368 − 3 18900 28010 3690 3616 52 106 19 1800 920 1720 HNeat Best [8] 1332 4 91 − 16 1325 800 1145 HNeat Pixel [8] 5184 225 661 21 4500 1740 1075 DQN Best Table 1: The upper table compares average total reward for various learning methods by running an ✏ -greedy policy with ✏ = 0 . 05 for a fixed number of steps. The lower table reports results of the single best performing episode for HNeat and DQN. HNeat produces deterministic policies that always get the same score while DQN used an ✏ -greedy policy with ✏ = 0 . 05 . 10 Figures from Mnih et al. (2013)
Deep Q-Learning Question : What if our state space S is too large to represent with a table? Examples : • s t = pixels of a video game • s t = continuous values of a sensors in a manufacturing robot • s t = sensor output from a self-driving car Answer : Use a parametric function to approximate the table entries Key Idea : Use a neural network Q(s,a; θ) to approximate Q * (s,a) 1. 2. Learn the parameters θ via SGD with training examples < s t , a t , r t , s t+1 > 11
Deep Q-Learning Whiteboard – Strawman loss function (i.e. what we cannot compute) – Approximating the Q function with a neural network – Approximating the Q function with a linear model – Deep Q-Learning – function approximators (<state, action i > à q-value vs. state à all action q-values) 12
Experience Replay • Problems with online updates for Deep Q-learning: – not i.i.d. as SGD would assume – quickly forget rare experiences that might later be useful to learn from • Uniform Experience Replay (Lin, 1992): – Keep a replay memory D = {e 1 , e 2 , … , e N } of N most recent experiences e t = <s t , a t , r t , s t+1 > – Alternate two steps: 1. Repeat T times: randomly sample e i from D and apply a Q- Learning update to e i 2. Agent selects an action using epsilon greedy policy to receive new experience that is added to D • Prioritized Experience Replay (Schaul et al, 2016) – similar to Uniform ER, but sample so as to prioritize experiences with high error 13
Alpha Go Game of Go ( �� ) Game 1 Fan Hui (Black), AlphaGo (White) • 19x19 board AlphaGo wins by 2.5 points 203 165 • Players alternately 93 92 201 151 57 49 51 164 33 156 154 205 35 202 34 152 50 10 47 43 45 221 220 4 223 9 play black/white 155 157 3 204 85 149 150 48 44 46 8 222 231 160 161 229 87 84 186 184 196 236 232 235 5 stones 228 39 31 169 271 86 183 58 82 81 224 79 83 7 268 153 199 230 148 36 37 41 168 185 181 182 188 88 195 89 80 68 32 176 177 • Goal is to fully 38 40 96 90 167 189 187 264 70 238 251 67 63 6 192 175 197 242 269 270 145 143 146 255 191 227 237 253 65 64 178 198 encircle the largest 194 193 207 95 139 142 144 261 265 254 170 252 69 66 248 78 42 267 99 98 226 225 213 209 214 266 71 72 77 76 region on the board 60 56 240 97 94 256 174 136 212 140 260 73 114 75 12 208 52 53 241 91 100 190 171 172 257 141 219 115 116 15 16 112 158 • Simple rules, but 59 54 23 133 101 102 258 173 262 121 263 119 110 27 14 11 25 113 210 55 130 131 103 104 259 217 129 215 218 117 62 120 28 18 26 111 extremely complex 211 135 1 132 128 216 137 206 233 118 74 13 108 2 249 107 159 game play 134 138 105 109 126 29 22 30 179 200 20 17 19 61 24 166 180 106 124 21 127 239 272 163 147 162 244 243 123 122 125 246 247 234 at 179 245 at 122 250 at 59 14 Figure from Silver et al. (2016)
Alpha Go • State space is too large to represent explicitly since # of sequences of moves is O(b d ) – Go: b=250 and d=150 – Chess: b=35 and d=80 • Key idea: – Define a neural network to approximate the value function – Train by policy gradient Rollout policy SL policy network RL policy network Value network Policy network Value network Neural network � � p ��� ( a ⎪ s ) � � ( s ′ ) p � p � p � Policy gradient Classi fj cation Classi fj cation Self Play Regression Data s ′ s Human expert positions Self-play positions Figure 1 | Neural network training pipeline and architecture. a , A fast the current player wins) in positions from the self-play data set. 15 Figure from Silver et al. (2016)
Alpha Go • Results of a 9p 3,500 Professional tournament 7p dan (p) 5p 3,000 3p 1p • From Silver et 9d 2,500 7d al. (2016): “a Elo Rating Amateur dan (d) 2,000 5d 230 point gap 1,500 3d 1d 1,000 1k corresponds to Beginner 3k kyu (k) 500 5k a 79% 7k 0 distributed AlphaGo AlphaGo Fan Hui Crazy Stone ��� Pachi Fuego GnuGo probability of Va Policy networ winning” 16 Figure from Silver et al. (2016)
Learning Objectives Reinforcement Learning: Q-Learning You should be able to… 1. Apply Q-Learning to a real-world environment 2. Implement Q-learning 3. Identify the conditions under which the Q- learning algorithm will converge to the true value function 4. Adapt Q-learning to Deep Q-learning by employing a neural network approximation to the Q function 5. Describe the connection between Deep Q- Learning and regression 17
Q-Learning Question: Answer: For the R(s,a) values shown on 0 0 the arrows below, which are the corresponding Q*(s,a) values? 8 8 4 8 4 8 Assume discount factor = 0.5. 2 2 18 16 0 9 4 8 0 0 0 8 8 0 8 8 8 4 8 16 4 4 8 0 16 4 8 19
K-MEANS 20
K-Means Outline • Clustering: Motivation / Applications • Optimization Background – Coordinate Descent – Block Coordinate Descent • Clustering – Inputs and Outputs – Objective-based Clustering • K-Means – K-Means Objective – Computational Complexity – K-Means Algorithm / Lloyd’s Method • K-Means Initialization – Random – Farthest Point – K-Means++ 21
Clustering, Informal Goals Goal : Automatically partition unlabeled data into groups of similar datapoints. Question : When and why would we want to do this? Useful for: • Automatically organizing data. • Understanding hidden structure in data. • Preprocessing for further analysis. • Representing high-dimensional data in a low-dimensional space (e.g., for visualization purposes). Slide courtesy of Nina Balcan
Applications (Clustering comes up everywhere…) • Cluster news articles or web pages or search results by topic. • Cluster protein sequences by function or genes according to expression profile. • Cluster users of social networks by interest (community detection). Twitter Network Facebook network Slide courtesy of Nina Balcan
Applications (Clustering comes up everywhere…) • Cluster customers according to purchase history. • Cluster galaxies or nearby stars (e.g. Sloan Digital Sky Survey) • And many many more applications…. Slide courtesy of Nina Balcan
Optimization Background Whiteboard: – Coordinate Descent – Block Coordinate Descent 25
Clustering Question: Which of these partitions is “better”? 26
K-Means Whiteboard: – Clustering: Inputs and Outputs – Objective-based Clustering – K-Means Objective – Computational Complexity – K-Means Algorithm / Lloyd’s Method 27
K-Means Initialization Whiteboard: – Random – Furthest Traversal – K-Means++ 28
Recommend
More recommend