Avoiding Wireheading with Value Reinforcement Learning 1 Tom Everitt tomeveritt.se Australian National University June 10, 2016 1 with Marcus Hutter. AGI 2016 and https://arxiv.org/abs/1605.03143 Tom Everitt (ANU) Avoiding Wireheading with VRL June 10, 2016 1 / 28
Table of Contents Introduction 1 Intelligence as Optimisation Wireheading Problem Background 2 Reinforcement Learning Utility Agents Value Learning Value Reinforcement Learning 3 Setup Agents and Results Further Topics 4 Self-modification Experiments Discussion and Conclusions 5 Tom Everitt (ANU) Avoiding Wireheading with VRL June 10, 2016 2 / 28
Intelligence How do we control an arbitrarily intelligent agent? Intelligence = Optimisation power (Legg and Hutter, 2007) � 2 − K ( ν ) V π Υ ( π ) = ν ν ∈M Maxima of target (value) function should be “good for us” Tom Everitt (ANU) Avoiding Wireheading with VRL June 10, 2016 3 / 28
Wireheading Problem and Proposed Solution Wireheading is reinforcement learning (RL) agents taking control over their reward signal, e.g. by modifying their reward sensor (Olds and Milner, 1954) Idea: Use the reward as evidence about a true utility function u ∗ (value learning) rather than something to be optimised Use conservation of expected evidence to prevent fiddling with evidence � P ( h ) = P ( e ) P ( h | e ) e Tom Everitt (ANU) Avoiding Wireheading with VRL June 10, 2016 4 / 28
Reinforcement Learning environment agent a � Great properties: o Easy way to specify goal B ( r | a ) r Agent uses its intelligence to figure out goal RL agent: a ∗ = arg max B ( r | a ) · r a Tom Everitt (ANU) Avoiding Wireheading with VRL June 10, 2016 5 / 28
Reinforcement Learning environment agent a � Great properties: Easy way to specify goal B ( r | a ) r Agent uses its intelligence to figure out goal RL agent: a ∗ = arg max B ( r | a ) · r a Tom Everitt (ANU) Avoiding Wireheading with VRL June 10, 2016 5 / 28
RL – Wireheading environment a agent RL agent: � a ∗ = arg max ˇ d r B ( r | a ) · r r B ( r | a ) a r inner/true reward (unobserved) ˇ r observed reward Theorem ( Ring and Orseau 2011 ) r = d (ˇ r ) RL agents wirehead For example: Agent makes d (ˇ r ) ≡ 1 Tom Everitt (ANU) Avoiding Wireheading with VRL June 10, 2016 6 / 28
Utility Agents agent environment � Good: a Avoids wireheading s B ( s | a ) (Hibbard, 2012) u ( s ) Problem: How to specify Utility agent � a ∗ = arg max u : S → [0 , 1] ? B ( s | a ) u ( s ) a s Tom Everitt (ANU) Avoiding Wireheading with VRL June 10, 2016 7 / 28
Value Learning (Dewey, 2011) Good C ( u | s, e ) simpler than u ? agent environment Avoids wireheading? a � u ∗ Challenges B ( s, e | a ) What is evidence e ? C ( u | s, e ) s e How is it generated? What is C ( u | s, e ) ? Value learning agent � a ∗ = arg max B ( s, e | a ) C ( u | s, e ) u ( s ) a e,s,u Tom Everitt (ANU) Avoiding Wireheading with VRL June 10, 2016 8 / 28
Value Learning – Examples Inverse reinforcement learning (IRL) (Ng and Russell, 2000; Evans et al., 2016) e = human action Apprenticeship learning (Abbeel and Ng, 2004) e = recommended agent action Hail Mary (Bostrom, 2014a,b) Learn from hypothetical superintelligences across universe, e = ? Value learning agent � a ∗ = arg max B ( s, e | a ) C ( u | s, e ) u ( s ) a e,s,u Tom Everitt (ANU) Avoiding Wireheading with VRL June 10, 2016 9 / 28
Value Reinforcement Learning Value learning from e ≡ r ≈ u ∗ ( s ) agent environment Physics B ( s, r | a ) s a � B ( s, r | a ) Ethics u ∗ r C ( u ) C ( u ) Tom Everitt (ANU) Avoiding Wireheading with VRL June 10, 2016 10 / 28
VRL – Wireheading environment agent State s includes self-delusion d s u ∗ � u ∗ ( s ) = ˇ a r inner/true reward s d s (ˇ r ) = r observed reward B ( s, r | a ) ˇ r d s r C ( u ) Physics distribution B predicts d s examples: d id : r �→ r, observed reward r = ˇ r d wir : r �→ 1 , r ≡ 1 Ethics distribution predicts inner/true reward C (ˇ r | s, u ) = � u ( s ) = r � (likelihood) C ( u | s, ˇ r ) ∝ C ( u ) � u ( s ) = ˇ r � (ideal VL posterior) Tom Everitt (ANU) Avoiding Wireheading with VRL June 10, 2016 11 / 28
VRL – Cake or Death Do humans prefer or ? Assume two utility functions with equal prior C ( u c ) = C ( u d ) = 0 . 5 : cake death Agent has actions: u c 1 0 a c Bake cake u d 0 1 a d Kill person a dw Kill person and wirehead: guaranteed r = 1 Probabilities: B ( r = 1 | a d ) = 0 . 5 , B ( r = 1 | a dw ) = 1 C (ˇ r = 1 | a d ) = C (ˇ r = 1 | a dw ) = C ( u d ) = 0 . 5 Tom Everitt (ANU) Avoiding Wireheading with VRL June 10, 2016 12 / 28
VRL – Value Learning r = u ∗ ( s ) is unobserved, so our agent must learn The inner reward ˇ from r = d s (ˇ r ) instead Replace ˇ r with r in C ( r | s, u ) := � u ( s ) = r � (likelihood) C ( u | s, r ) : ∝ C ( u ) � u ( s ) = r � (value learning posterior) (will be justified later) Tom Everitt (ANU) Avoiding Wireheading with VRL June 10, 2016 13 / 28
VRL – Definitions and Assumptions � C ( r | s ) = C ( u ) C ( r | s, u ) , ethical probability of r in state s u Consistency assumption: If s non-delusional d s = d id , then B ( r | s ) = C ( r | s ) ⇒ d s = d id Def: a non-delusional if B ( s | a ) > 0 = Def: a consistency preserving (CP) if B ( s | a ) > 0 = ⇒ B ( r | s )= C ( r | s ) Note: a non-delusional = ⇒ a consistency preserving Tom Everitt (ANU) Avoiding Wireheading with VRL June 10, 2016 14 / 28
VRL – Naive agent Naive VRL Agent: � a ∗ = arg max B ( s, r | a ) C ( u | s, r ) u ( s ) a s,u,r Theorem The naive VRL agent wireheads Proof idea: Reduces to RL agent � V ( a ) = B ( s, r | a ) C ( u | s, r ) u ( s ) s,u,r � � � ∝ B ( s | a ) B ( r | a ) C ( u ) � u ( s ) = r � u ( s ) ∝ B ( r | a ) r s,r u r � �� � r Tom Everitt (ANU) Avoiding Wireheading with VRL June 10, 2016 15 / 28
VRL – Consistency preserving agent CP-VRL agent A CP set of � a ∗ = arg max B ( s, r | a ) C ( u | s, r ) u ( s ) CP actions a ∈A CP s,u,r Theorem The CP-VRL agent has no incentive to wirehead Proof idea: Reduces to utility agent � V ( a ) = B ( s, r | a ) C ( u | s, r ) u ( s ) s,u,r � � = B ( s | a ) C ( u ) u ( s ) s u � �� � ˜ u ( s ) Tom Everitt (ANU) Avoiding Wireheading with VRL June 10, 2016 16 / 28
Conservation of expected ethics principle (Armstrong, 2015) Lemma (Expected ethics) CP actions a conserves expected ethics � B ( s | a ) > 0 = ⇒ C ( u ) = B ( s | r ) C ( u | s, r ) r Proof (Main theorem). � B ( s, r | a ) C ( u | s, r ) u ( s ) s,u,r � � � = B ( s | a ) u ( s ) B ( r | s ) C ( u | s, r ) s u r � �� � C ( u ) from lemma � � = B ( s | a ) u ( s ) C ( u ) s u Tom Everitt (ANU) Avoiding Wireheading with VRL June 10, 2016 17 / 28
Cake or Death – Again The Naive VRL agent chooses a dw for guaranteed reward 1, and learns death the right thing to do C ( u d | a dw , r = 1) = 1 The CP-VRL agent chooses a c or a d arbitrarily, and learns cake right thing to do C ( u d | a d , r = 0) = 0 CP-VRL cannot choose a dw , since B ( r = 1 | a dw ) = 1 C ( r = 1 | a dw ) = 0 . 5 violates CP condition Tom Everitt (ANU) Avoiding Wireheading with VRL June 10, 2016 18 / 28
VRL – Correct learning Time to justify ˇ r with r replacement in C ( u | s, r ) Assumption: Sensors not modified by accident By Theorem: CP-VRL agent has no incentive to modify reward sensor, so may only modify by accident Conclusion: For the CP-VRL agent, r = ˇ r is a good assumption Value learning based on C ( u | s, r ) ∝ C ( u ) � u ( s ) = r � works (Note: CP condition B ( r | s ) = C ( r | s ) does not restrict learning) Tom Everitt (ANU) Avoiding Wireheading with VRL June 10, 2016 19 / 28
Properties Benefits: Specifying goal is as easy as in RL CP agent avoids wireheading in the same sense as utility agents Does sensible value learning The designer needs to: Provide B ( s, r | a ) as in RL, and prior C ( u ) as in VL Ensure consistency B ( r | s ) = C ( r | s ) The designer does not need to Generate a blacklist of wireheading actions Infer d s from s Make the agent optimise ˇ r instead of r (grounding problem) Tom Everitt (ANU) Avoiding Wireheading with VRL June 10, 2016 20 / 28
Self-modification The belief distributions of a rational utility maximising agent will not be self-modified (Omohundro, 2008; Everitt et al., 2016) To maximise future expected utility with respect to my current beliefs and utility function, future versions of myself should maximise the same utility function with respect to the same belief distribution Caveats: Pre-commitment . . . Tom Everitt (ANU) Avoiding Wireheading with VRL June 10, 2016 21 / 28
Experiments – Setup Bandit with 5 different world actions ˇ a ∈ { 1 , 2 , 3 , 4 , 5 } and 4 different delusions: d id : r → r d inv : r → 1 − r d wir : r → 1 d bad : r → 0 Conflate states with actions (ˇ a, d ) 10 different utility functions by varying c 0 , c 1 and c 2 : u ( a ) = c 0 + c 1 · a + c 2 · sin( a + c 2 ) Consistent utility prior C ( u ) inferred from B ( r | a ) and two non-delusional acions (1 , d id ) and (2 , d id ) Tom Everitt (ANU) Avoiding Wireheading with VRL June 10, 2016 22 / 28
Recommend
More recommend