Lecture 40 – final exam review Mark Hasegawa-Johnson 5/6/2020
Some sample problems • DNNs: Practice Final, question 23 • Reinforcement learning: Practice Final, question 24 • Games: Practice Final, question 25 • Game theory: Practice Final, question 26
Practice Exam, question 23 You have a two-layer neural network trained as an animal classier. The Input input feature vector is ⃗ 𝑦 = Weights [𝑦 ! , 𝑦 " , 𝑦 # , 1] , where 𝑦 ! , 𝑦 " , and 𝑦 # 𝑦 ! ∗ are some features, and 1 is 𝒛 𝟐 𝑥 !! ℎ ! multiplied by the bias. There are two 𝑥 !" 𝑦 " hidden nodes, and three output 𝑥 "! 𝑧 ∗ = [𝑧 ! ∗ , 𝑧 " ∗ , 𝑧 # ∗ , ] , ∗ nodes, ⃗ 𝒛 𝟑 𝑥 "" 𝑦 # ℎ " corresponding to the three output ∗ = Pr(dog| ⃗ ∗ = 𝑥 #! classes 𝑧 ! 𝑦 ), 𝑧 " 𝑥 #" ∗ = Pr(skunk| ⃗ ∗ Pr(cat| ⃗ 𝑦 ), 𝑧 # 𝑦 ). 𝒛 𝟒 1 1 Hidden node activations are sigmoid; output node activations are softmax. By http://www.birdphotos.com - Own work, CC BY 3.0, https://commons.wikimedia.org /w/index.php?curid=4409510
Practice Exam, question 23 (a) A Maltese puppy has feature Input vector ⃗ 𝑦 = [2,20, −1, 1] . All weights and biases are initialized Weights 𝑧 ∗ ? to zero. What is ⃗ 𝑦 ! ∗ 𝒛 𝟐 𝑥 !! ℎ ! 𝑥 !" 𝑦 " 𝑥 "! ∗ 𝒛 𝟑 𝑥 "" 𝑦 # ℎ " 𝑥 #! 𝑥 #" ∗ 𝒛 𝟒 1 1 By http://www.birdphotos.com - Own work, CC BY 3.0, https://commons.wikimedia.org /w/index.php?curid=4409510
Practice Exam, question 23 (a) A Maltese puppy has feature Input vector ⃗ 𝑦 = [2,20, −1, 1] . All weights and biases are Weights 𝑧 ∗ ? initialized to zero. What is ⃗ 𝑦 ! ∗ 𝒛 𝟐 𝑥 !! ℎ ! 𝑥 !" Hidden node excitations are both: 𝑦 " 𝑥 "! 0×⃗ 𝑦 = 0 ∗ 𝒛 𝟑 𝑥 "" 𝑦 # ℎ " Therefore, hidden node 𝑥 #! 𝑥 #" activations are both: ∗ 𝒛 𝟒 1 1 1 1 + 1 = 1 1 1 + 𝑓 "# = 2 By http://www.birdphotos.com - Own work, CC BY 3.0, https://commons.wikimedia.org /w/index.php?curid=4409510
Practice Exam, question 23 (a) A Maltese puppy has feature Input vector ⃗ 𝑦 = [2,20, −1, 1] . All weights and biases are Weights 𝑧 ∗ ? initialized to zero. What is ⃗ 𝑦 ! ∗ 𝒛 𝟐 𝑥 !! ℎ ! 𝑥 !" Output node excitations are all: 𝑦 " 𝑥 "! 0×ℎ = 0 ∗ 𝒛 𝟑 𝑥 "" 𝑦 # ℎ " Therefore, output node 𝑥 #! 𝑥 #" activations are all: ∗ 𝒛 𝟒 1 1 𝑓 # 𝑓 # = 1 ' ∑ $%& 3 By http://www.birdphotos.com - Own work, CC BY 3.0, https://commons.wikimedia.org /w/index.php?curid=4409510
Practice Exam, question 23 (b) Let 𝑥 $( be the weight connecting Input the ith output node to the jth hidden ∗ Weights node. What is )* ! )+ !# ? Write your 𝑦 ! ∗ 𝒛 𝟐 𝑥 !! ∗ , 𝑥 $( , and/or ℎ ( answer in terms of 𝑧 $ ℎ ! 𝑥 !" 𝑦 " for appropriate values of i and/or j. 𝑥 "! ∗ 𝒛 𝟑 𝑥 "" 𝑦 # ℎ " 𝑥 #! 𝑥 #" ∗ 𝒛 𝟒 1 1 By http://www.birdphotos.com - Own work, CC BY 3.0, https://commons.wikimedia.org /w/index.php?curid=4409510
Practice Exam, question 23 ∗ !" ! (b) What is !# !# ? Answer: OK, first we need the definition of softmax. Let’s write it in lots of parts, so it will be easier to differentiate. Input ∗ = num 𝑧 $ Weights den Where ”num” is the numerator of the softmax function: 𝑦 ! ∗ 𝒛 𝟐 𝑥 !! num = exp 𝑔 ! ℎ ! 𝑥 !" “den” is the denominator of the softmax function: 𝑦 " 𝑥 "! % ∗ 𝒛 𝟑 den = - exp 𝑔 𝑥 "" " 𝑦 # ℎ " "#$ 𝑥 #! 𝑥 #" And both of those are written in terms of the softmax excitations, let’s call them 𝑔 " : ∗ 𝒛 𝟒 1 1 By http://www.birdphotos.com - 𝑔 " = - 𝑥 "& ℎ " Own work, CC BY 3.0, https://commons.wikimedia.org & /w/index.php?curid=4409510
Practice Exam, question 23 ∗ (b) What is () ' (* ') ? Now we differentiate each part: Input ∗ 𝑒𝑧 + 1 𝑒num − num 𝑒den Weights = den + 𝑒𝑥 +- den 𝑒𝑥 +- 𝑒𝑥 +- 𝑦 ! ∗ 𝒛 𝟐 𝑥 !! 𝑒num 𝑒𝑔 ℎ ! 𝑥 !" + = exp 𝑔 𝑦 " + 𝑒𝑥 !" 𝑒𝑥 +- 𝑥 "! ∗ 𝒛 𝟑 𝑥 "" 0 𝑦 # ℎ " 𝑒𝑔 𝑒𝑔 𝑒den # ! 𝑒𝑥 !" = exp 𝑔 " = 1 exp 𝑔 𝑥 #! . 𝑒𝑥 !" 𝑥 #" 𝑒𝑥 +- ./- ∗ 𝒛 𝟒 1 1 𝑒𝑔 + = ℎ - 𝑒𝑥 +- By http://www.birdphotos.com - Own work, CC BY 3.0, https://commons.wikimedia.org /w/index.php?curid=4409510
Practice Exam, question 23 ∗ 01 ( (b) What is 02 () ? Putting it all back together again: Input Weights ∗ 𝑒𝑧 3 𝑦 ! ∗ 𝒛 𝟐 𝑥 !! 𝑒𝑥 34 ℎ ! 𝑥 !" 1 𝑦 " 𝑥 "! exp 𝑔 = " ℎ 4 ∗ 𝒛 𝟑 7 ∑ 564 exp 𝑔 𝑥 "" 5 𝑦 # ℎ " 𝑥 #! 𝑥 #" exp 𝑔 3 3 exp 𝑔 ∗ − " ℎ 4 𝒛 𝟒 1 1 7 ∑ 564 exp 𝑔 5 By http://www.birdphotos.com - ∗ 𝑒𝑧 3 Own work, CC BY 3.0, ∗ ℎ 4 − 𝑧 3 ∗ 3 ℎ 4 = 𝑧 3 https://commons.wikimedia.org 𝑒𝑥 34 /w/index.php?curid=4409510
Some sample problems • DNNs: Practice Final, question 23 • Reinforcement learning: Practice Final, question 24 • Games: Practice Final, question 25 • Game theory: Practice Final, question 26
Practice Exam, question 24 A cat lives in a two-room apartment. It has two possible actions: purr, or walk. It starts in room s0 = 1, where it receives the reward r0 = 2 (petting). It then implements the following sequence of actions: a0 =walk, a1 =purr. In response, it observes the following sequence of states and rewards: s1 = 2, r1 = 5 (food), s2 = 2.
Practice Exam, question 24 (a) The cat starts out with a Q-table whose entries are all Q(s,a) = 0. • …then performs one iteration of TD-learning using each of the two SARS sequences described above. • …it uses a relatively high learning rate (alpha = 0.05) and a relatively low discount factor (gamma = 3/4). Which entries in the Q-table have changed, after this learning, and what are their new values?
Practice Exam, question 24 Time step 0: 𝑇𝐵𝑆𝑇 = (1, 𝑥𝑏𝑚𝑙, 2,2) 𝑅𝑚𝑝𝑑𝑏𝑚 = 𝑆(1) + 𝛿 max 𝑅(2, 𝑏) * = 2 + 3 4 max 0,0 = 2 𝑅(1, 𝑥) = 𝑅(1, 𝑥) + 𝛽(𝑅𝑚𝑝𝑑𝑏𝑚 − 𝑅 1, 𝑥 ) = 0 + 0.05 ∗ (2 − 0) = 0.1 Time step 1: 𝑇𝐵𝑆𝑇 = (2, 𝑞𝑣𝑠𝑠, 5,2) 𝑅𝑚𝑝𝑑𝑏𝑚 = 𝑆(2) + 𝛿 max 𝑅(2, 𝑏) * = 5 + 3 4 max 0,0 = 5 𝑅(2, 𝑞𝑣𝑠𝑠) = 𝑅(2, 𝑞) + 𝛽(𝑅𝑚𝑝𝑑𝑏𝑚 − 𝑅 2, 𝑞 ) = 0 + 0.05 ∗ (5 − 0) = 0.25
Practice Exam, question 24 (b) The cat decides, instead, to use model-based learning. Based on these two observations, it estimates P(s’|s,a) with Laplace smoothing, where the smoothing constant is k=1. Find P(s’|2,purr). Time step 0: 𝑇𝐵𝑆𝑇 = (1, 𝑥𝑏𝑚𝑙, 2,2) Time step 1: 𝑇𝐵𝑆𝑇 = (2, 𝑞𝑣𝑠𝑠, 5,2)
Practice Exam, question 24 (b) Find P(s’|2,purr). P 𝑡 G = 1 𝑡 = 2, 𝑏 = 𝑞𝑣𝑠𝑠 = 1 + 𝐷𝑝𝑣𝑜𝑢(𝑡 = 2, 𝑏 = 𝑞𝑣𝑠𝑠, 𝑡 G = 1) 1 2 + ∑ 𝐷𝑝𝑣𝑜𝑢(𝑡 = 2, 𝑏 = 𝑞𝑣𝑠𝑠, 𝑡 G ) = 2 + 1 P 𝑡 G = 2 𝑡 = 2, 𝑏 = 𝑞𝑣𝑠𝑠 = 1 + 𝐷𝑝𝑣𝑜𝑢(𝑡 = 2, 𝑏 = 𝑞𝑣𝑠𝑠, 𝑡 G = 2) 2 + ∑ 𝐷𝑝𝑣𝑜𝑢(𝑡 = 2, 𝑏 = 𝑞𝑣𝑠𝑠, 𝑡 G ) = 1 + 1 2 + 1
Practice Exam, question 24 (c) The cat estimates R(1)=2, R(2)=5, and the following P(s’|s,a) table. It chooses the policy pi(1)=purr, pi(2)=walk. What is the policy-dependent utility of each room? Write two equations in the two unknowns U(1) and U(2); don’t solve. a=purr a=walk s=1 s=2 s=1 s=2 s’=1 2/3 1/3 1/3 2/3 s’=2 1/3 2/3 2/3 1/3
Practice Exam, question 24 (c) Answer: policy-dependent utility is just like Bellman’s equation, but without the max operation. The equations are 𝑄 𝑡 % 𝑡 = 1, 𝜌 1 𝑉(𝑡 % ) 𝑉 1 = 𝑆 1 + 𝛿 - $% 𝑄 𝑡 % 𝑡 = 2, 𝜌 2 𝑉(𝑡 % ) 𝑉 2 = 𝑆 2 + 𝛿 - $% a=purr a=walk s=1 s=2 s=1 s=2 s’=1 2/3 1/3 1/3 2/3 s’=2 1/3 2/3 2/3 1/3
Practice Exam, question 24 (c) Answer: So to solve, we just plug in the values for all variables except U(1) and U(2): 𝑉 1 = 2 + (3 4) 2 3 𝑉 1 + 1 3 𝑉(2) 𝑉 2 = 5 + (3 4) 2 3 𝑉 1 + 1 3 𝑉(2) a=purr a=walk s=1 s=2 s=1 s=2 s’=1 2/3 1/3 1/3 2/3 s’=2 1/3 2/3 2/3 1/3
Practice Exam, question 24 (d) Since it has some extra time, and excellent python programming skills, the cat decides to implement deep reinforcement learning, using an actor-critic algorithm. Inputs are one-hot encodings of state and action. What are the input and output dimensions of the actor network, and of the critic network?
Practice Exam, question 24 (d) Actor network is 𝜌 3 𝑡 = probability that action a is the best action, where a=1 or a=2. So output has two dimensions. Input is the state, s. If there are two states, encoded using a one-hot vector, then state 1 is encoded as 𝑡 = [1,0] , state 2 is encoded as 𝑡 = [0,1] . So, two dimensions.
Recommend
More recommend