midterm postmortem cse 473 ar ficial intelligence
play

Midterm$Postmortem$ CSE$473:$Ar+ficial$Intelligence$ $ - PDF document

Midterm$Postmortem$ CSE$473:$Ar+ficial$Intelligence$ $ Reinforcement$Learning$ ! It$was$long,$hard$ " $ ! Max $ $ $41$$ ! Min $ $ $13$ ! Mean$&$Median $27$ ! Final$ ! Will$include$some$of$the$midterm$problems$ Dan$Weld$


  1. Midterm$Postmortem$ CSE$473:$Ar+ficial$Intelligence$ $ Reinforcement$Learning$ ! It$was$long,$hard…$ " $ ! Max $ $ $41$$ ! Min $ $ $13$ ! Mean$&$Median $27$ ! Final$ ! Will$include$some$of$the$midterm$problems$ Dan$Weld$ University$of$Washington$ [Most$of$these$slides$were$created$by$Dan$Klein$and$Pieter$Abbeel$for$CS188$Intro$to$AI$at$UC$Berkeley.$$All$CS188$materials$are$available$ at$hNp://ai.berkeley.edu.]$ Office$Hour$Change$(this$week)$ Reinforcement$Learning$ ! Thurs$ 10`11am$ ! CSE$588$ ! (Not$Fri)$ “Listen Simkins, when I said that you could always come to me with your problems, I meant during office hours!” Two$Key$Ideas$ Reinforcement$Learning$ ! Credit$assignment$problem$ $ Age ! Explora+on`exploita+on$tradeoff$ nt$ State:$s$ Ac+ons:$a$ Reward:$r$ Environm ent$ ! Basic$idea:$ ! Receive$feedback$in$the$form$of$rewards$ ! Agent’s$u+lity$is$defined$by$the$reward$func+on$ ! Must$(learn$to)$act$so$as$to$maximize$expected$ rewards$ ! All$learning$is$based$on$observed$samples$of$ outcomes!$

  2. The “ Credit Assignment ” Problem The “ Credit Assignment ” Problem I ’ m in state 43, reward = 0, action = 2 I ’ m in state 43, reward = 0, action = 2 “ “ “ 39, “ = 0, “ = 4 7 8 The “ Credit Assignment ” Problem The “ Credit Assignment ” Problem I ’ m in state 43, reward = 0, action = 2 I ’ m in state 43, reward = 0, action = 2 “ “ “ 39, “ = 0, “ = 4 “ “ “ 39, “ = 0, “ = 4 “ “ “ 22, “ = 0, “ = 1 “ “ “ 22, “ = 0, “ = 1 “ “ “ 21, “ = 0, “ = 1 9 10 The “ Credit Assignment ” Problem The “ Credit Assignment ” Problem I ’ m in state 43, reward = 0, action = 2 I ’ m in state 43, reward = 0, action = 2 “ “ “ 39, “ = 0, “ = 4 “ “ “ 39, “ = 0, “ = 4 “ “ “ 22, “ = 0, “ = 1 “ “ “ 22, “ = 0, “ = 1 “ “ “ 21, “ = 0, “ = 1 “ “ “ 21, “ = 0, “ = 1 “ “ “ 21, “ = 0, “ = 1 “ “ “ 21, “ = 0, “ = 1 “ “ “ 13, “ = 0, “ = 2 11 12

  3. The “ Credit Assignment ” Problem The “ Credit Assignment ” Problem I ’ m in state 43, reward = 0, action = 2 I ’ m in state 43, reward = 0, action = 2 “ “ “ 39, “ = 0, “ = 4 “ “ “ 39, “ = 0, “ = 4 “ “ “ 22, “ = 0, “ = 1 “ “ “ 22, “ = 0, “ = 1 “ “ “ 21, “ = 0, “ = 1 “ “ “ 21, “ = 0, “ = 1 “ “ “ 21, “ = 0, “ = 1 “ “ “ 21, “ = 0, “ = 1 “ “ “ 13, “ = 0, “ = 2 “ “ “ 13, “ = 0, “ = 2 Yippee! I got to a state with a big reward! But which of my actions along the way “ “ “ 54, “ = 0, “ = 2 “ “ “ 54, “ = 0, “ = 2 actually helped me get there?? This is the Credit Assignment problem. “ “ “ 26, “ = 100 , 13 14 Exploration-Exploitation tradeoff Example: Animal Learning ! You have visited part of the state space and found a ! RL studied experimentally for more than 60 years in reward of 100 psychology ! is this the best you can hope for??? ! Rewards: food, pain, hunger, drugs, etc. ! Exploitation : should I stick with what I know and find a ! Mechanisms and sophistication debated good policy w.r.t. this knowledge? ! at risk of missing out on a better reward somewhere ! Example: foraging ! Exploration : should I look for states w/ more reward? ! at risk of wasting time & getting some negative reward ! Bees learn near-optimal foraging plan in field of artificial flowers with controlled nectar supplies ! Bees have a direct neural connection from nectar intake measurement to motor planning area 15 Demos$ Example: Backgammon ! hNp://inst.eecs.berkeley.edu/~ee128/fa11/ ! Reward only for win / loss in terminal states, zero otherwise videos.html$ ! TD-Gammon learns a function approximation to V(s) using a neural network ! Combined with depth 3 search, one of the top 3 players in the world ! You could imagine training Pacman this way … ! … but it ’ s tricky! (It ’ s also P3) 18

  4. Example:$Learning$to$Walk$ Example:$Learning$to$Walk$ Ini+al$ A$Learning$Trial$ Aher$Learning$[1K$ Trials]$ Ini+al$ [Kohl$and$Stone,$ICRA$2004]$ [Kohl$and$Stone,$ICRA$2004]$ [Video:$AIBO$WALK$–$ini+al Example:$Learning$to$Walk$ Example:$Sidewinding$ Finished$ [Kohl$and$Stone,$ICRA$2004]$ [Video:$AIBO$WALK$–$finish [Andrew$Ng]$ [Video:$SNAKE$–$climbStep+sidewin The$Crawler!$ Video$of$Demo$Crawler$Bot$ [Demo:$Crawler$Bot$(L10D1)]$[You,$in$Proj

  5. Other Applications Reinforcement$Learning$ ! S+ll$assume$a$Markov$decision$process$(MDP):$ ! A$set$of$states$s$ ∈ $S$ ! Robotic control ! A$set$of$ac+ons$(per$state)$A$ ! helicopter maneuvering, autonomous vehicles ! A$model$T(s,a,s’)$ ! Mars rover - path planning, oversubscription planning ! A$reward$func+on$R(s,a,s’)$&$discount$γ$ ! elevator planning ! S+ll$looking$for$a$policy$ π (s)$ ! Game playing - backgammon, tetris, checkers ! Neuroscience ! Computational Finance, Sequential Auctions ! New$twist:$don’t$know$T$or$R$ ! Assisting elderly in simple tasks ! I.e.$we$don’t$know$which$states$are$good$or$what$the$ac+ons$do$ ! Spoken dialog management ! Must$actually$try$ac+ons$and$states$out$to$learn$ ! Communication Networks – switching, routing, flow control ! War planning, evacuation planning Overview$ Offline$(MDPs)$vs.$Online$(RL)$ ! Offline$Planning$(MDPs)$ ! Value$itera+on,$policy$itera+on$ ! Online:$Reinforcement$Learning$ ! Model`Based$ ! Model`Free$ ! Passive$ ! Ac+ve$ Offline$ Online$ Solu+on$ Learning$ Passive$Reinforcement$Learning$ Passive$Reinforcement$Learning$ ! Simplified$task:$policy$evalua+on$ ! Input:$a$fixed$policy$ π (s)$ ! You$don’t$know$the$transi+ons$T(s,a,s’)$ ! You$don’t$know$the$rewards$R(s,a,s’)$ ! Goal:$learn$the$state$values$ ! In$this$case:$ ! Learner$is$“along$for$the$ride”$ ! No$choice$about$what$ac+ons$to$take$ ! Just$execute$the$policy$and$learn$from$experience$ ! This$is$NOT$offline$planning!$$You$actually$take$ac+ons$in$the$world.$

  6. Model`Based$Learning$ Model`Based$Learning$ ! Model`Based$Idea:$ ! Learn$an$approximate$model$based$on$experiences$ ! Solve$for$values$as$if$the$learned$model$were$correct$ ! Step$1:$Learn$empirical$MDP$model$ ! Count$outcomes$s’$for$each$s,$a$ ! Normalize$to$give$an$es+mate$of ! ! Discover$each$ !!!!!!!!!!!!!!!!!!!!!! when$we$experience$(s,$a,$s’)$ ! Step$2:$Solve$the$learned$MDP$ ! For$example,$use$value$itera+on,$as$before$ Example:$Model`Based$Learning$ Model`Free$Learning$ Input$ Observed$Episodes$ Learned$ Policy$ π $$ (Training) $ Model $ Episode$ Episode$ T(s,a,s’).$ 1$ 2$ $ T(B,$east,$C)$=$ B,$east,$C,$ B,$east,$C,$ A! 1.00$ `1$ `1$ T(C,$east,$D)$=$ C,$east,$D,$ C,$east,$D,$ 0.75$ B! C $ D $ `1$ `1$ T(C,$east,$A)$=$ 0.25$ D,$exit,$$x,$ D,$exit,$$x,$ Episode$ Episode$ …$ +10$ +10$ E $ $ 3$ 4$ R(s,a,s’).$ E,$north,$C,$`1$ E,$north,$C,$`1$ C,$east,$$$D,$`1$ C,$east,$$$A,$`1$ $ R(B,$east,$C)$=$`1$ D,$exit,$$$$x,$ A,$exit,$$$$x,$`10$ Assume:' γ $=$1$ R(C,$east,$D)$=$`1$ +10$ R(D,$exit,$x)$=$ +10$ …$ Simple$Example:$Expected$Age$ Direct$Evalua+on$ Goal:$Compute$expected$age$of$CSE$473$ ! Goal:$Compute$values$for$each$state$ students$ under$ π $ Known$ P(A)$ ! Idea:$Average$together$observed$ sample$values$ Without$P(A),$instead$collect$samples$[a 1 ,$a 2 ,$…$a N ]$ ! Act$according$to$ π $ ! Every$+me$you$visit$a$state,$write$down$ Unknown$P(A):$“Model$ Unknown$P(A):$“Model$ what$the$sum$of$discounted$rewards$ Based”$ Free”$ Why$does$ turned$out$to$be$ Why$does$ this$work?$$ this$work?$$ ! Average$those$samples$ Because$ Because$ samples$ eventually$ appear$with$ ! This$is$called$direct$evalua+on$ you$learn$the$ the$right$ right$model.$ frequencies.$

Recommend


More recommend