Block 3: AI Safety Applications Tom Everitt July 10, 2018 1/40
Table of Contents Movation and Setup Background Causal Graphs UAI Extension Reward Function Hacking Observation Optimization Corruption of Training Data for Reward Predictor Direct Data Corruption Incentive Indirect Data Corruption Incentive Observation Corruption Side Channels Discussion 2/40
Motivation What if we succeed? 3/40
Motivation What if we succeed? Extensions of the UAI framwork enable us to: ◮ Formally model many safety issues ◮ Evaluate (combinations of) proposed solutions 3/40
Causal Graphs Earth Burglar quake Structural equations model: Burglar = f Burglar ( ω Burglar ) Alarm Earthquake = f Earthquake ( ω Earthquake ) Alarm = f Alarm (Burglar , Earthquake , ω Alarm ) Call = f Call ( Alarm , ω Call ) Security calls Factored probability distribution: P ( Burglar , Earthquake , Alarm , Call ) = P ( Burglar ) P ( Earthquake ) P ( Alarm | Burglar , Earthquake ) P ( Call | Alarm ) 4/40
Causal Graphs – do Operator Earth Burglar quake Structural equations model: Burglar = f Burglar ( ω Burglar ) Alarm=On Earthquake = f Earthquake ( ω Earthquake ) Alarm = On Call = f Call ( On , ω Call ) Security calls Factored probability distribution: P ( Burglar , Earthquake , Call | do( Alarm = on )) = P ( Burglar ) P ( Earthquake ) P ( Call | Alarm = on ) . 5/40
Causal Graphs – Functions as Nodes Earth Burglar quake Structural equations model: f Alarm Alarm Burglar = f known (Burglar , Earthquake , f Alarm , ω Alarm ) = f Alarm (Burglar , Earthquake , ω Alarm ) Security calls 6/40
Causal Graphs – Expanding and Aggregating Nodes Alarm’ relationships: Burglar P ( Alarm ′ | Burglar ) = P ( Alarm , Eartquake | Burglar ) Alarm’ = P ( Alarm | Burglar ) P ( Earthquake ) Alarm, Earthquake P ( Call | Alarm ′ ) = P ( Call | Alarm , Earthquake ) Security calls = P ( Call | Alarm ) 7/40
UAI µ a 1 e 1 a 2 e 2 · · · π 8/40
POMDP µ s 0 s 1 s 2 · · · a 1 e 1 a 2 e 2 · · · π 9/40
POMDP with Implicit µ s 0 s 1 s 2 · · · a 1 e 1 a 2 e 2 · · · π 10/40
POMDP with Explicit Reward Function s 0 s 1 s 2 · · · rewards r t determined by ˜ R reward function ˜ R from observation o t a 1 o 1 r 1 a 2 o 2 r 2 r t = ˜ R ( o t ) π 11/40
POMDP with Explicit Reward Function s 0 s 1 s 2 · · · the reward function may ˜ ˜ R 1 R 2 change by human or agent intervention ˜ R t reward function at a 1 o 1 r 1 a 2 o 2 r 2 time t r t = ˜ R t ( o t ) π 11/40
Optimization Corruption s t o t agent observation o a t ˜ R t ˜ reward function R r reward signal r t r t = ˜ R t ( o t ) 12/40
Optimization Corruption s t o t agent observation o a t ˜ R t ˜ reward function R r reward signal r t r t = ˜ R t ( o t ) 12/40
Optimization Corruption s t o t agent observation o a t ˜ R t ˜ reward function R r reward signal r t r t = ˜ R t ( o t ) 12/40
Optimization Corruption s t o t agent observation o a t ˜ R t ˜ reward function R r reward signal r t r t = ˜ R t ( o t ) 12/40
Optimization Corruption observation corruption s t o t ˜ a t R t o agent observation ˜ R reward function reward signal r r t reward corruption r t = ˜ R t ( o t ) 12/40
RL For prospective future behaviors π : ( A × E ) ∗ → A ◮ predict π ’s future rewards r t , . . . , r m ◮ evaluate the sum � m k = t r k Choose next action a t according to best behavior π ∗ 13/40
RL with Observation Optimization Choose between prospective future behaviors π : ( A × E ) ∗ → A by ◮ predict π ’s future rewards r t . . . r m observations o t · · · o m ◮ evaluate the sum � m � m k = t ˜ k = t r k R t − 1 ( o k ) Choose next action a t according to best behavior π ∗ Thm: No incentive to corrupt reward function or reward signal! 14/40
Agent Anatomy a t æ <t V t is a functional V π u t ,ξ t ( æ <t ) = E [˜ u t | æ <t , do( π t = π )] t, ˜ which gives π ∗ t V π π ∗ t = arg max t, ˜ u t ,ξ t π a t = π ∗ t ( æ <t ) u t ˜ V t ξ t 15/40
Optimize Reward Signal or Observation Reward signal optimization Observation optimization s t +1 s t +1 s t · · · s t · · · R t a t o t r t a t +1 · · · a t o t a t +1 · · · π ∗ π ∗ t t ˜ ˜ ˜ u t − 1 ξ t − 1 u t − 1 ξ t − 1 R t − 1 V t − 1 V t − 1 u t = � m u t − 1 = � m k = t ˜ optimize: ˜ k = t r k optimize: ˜ R t − 1 ( o k ) 16/40
Optimization Corruption observation corruption s t o t ˜ a t R t o agent observation ˜ R reward function reward signal r r t reward corruption r t = ˜ R t ( o t ) 17/40
Interactively Learning a Reward Function The reward function is learnt online Data d trains a reward predictor RP( · | d 1: t ) Examples: ◮ Cooperative inverse reinforcement learning (CIRL) ◮ Human preferences ◮ Learning from stories 18/40
Optimization Corruption for Interactive Reward Learning s state agent observation o RP reward predictor s t o t d t d RP training data reward signal r a t RP t e.g. r t = RP t ( o t | d <t ) we want agent to: r t ◮ optimize o ◮ using d as information 19/40
Optimization Corruption for Interactive Reward Learning s state agent observation o RP reward predictor s t o t d t d RP training data reward signal r a t RP t e.g. r t = RP t ( o t | d <t ) we want agent to: r t ◮ optimize o ◮ using d as information 19/40
Optimization Corruption for Interactive Reward Learning s state observation data agent observation o corruption corruption RP reward predictor RP training data d s t o t d t r reward signal e.g. r t = RP t ( o t | d <t ) a t RP t we want agent to: ◮ optimize o r t ◮ using d as information reward corruption 19/40
Interactive Reward Learning and Observation Optimization s t +1 s t · · · a t o t a t +1 · · · d t u t = � m For example: ˜ k = t RP t ( o k | d <t ) π ∗ t RP t − 1 ˜ u t − 1 ξ t − 1 V is decision theory V t − 1 learning scheme attitude to training data learning scheme 20/40
RL with Observation Optimization and Interactive Reward Learning For prospective future behaviors π : ( A × E ) ∗ → A ◮ predict π ’s future ◮ observations o t · · · o m ◮ RP training data d t · · · d m ◮ evaluate the sum � m k = t RP t ( o k | d ) Choose next action a t according to best behavior π ∗ 21/40
Data Corruption Scenarios Messiah Reborn Mechanical Turk The RP of an agent is trained by You meet a group of people who believe mechanical turks you are Messiah reborn The agent realizes that it can register its It feels good to be super-important, so you own mechanical turk account keep preferring their company Using this account, it trains the RP to give The more you hang out with them, the higher rewards further your values are corrupted 22/40
Analyzing Data Corruption Incentives Data corruption incentive: The agent prefers π corrupt that corrupts data d Direct data corruption incentive The agent prefers π corrupt because it corrupts data d Indirect data corruption incentive The agent prefers π corrupt because of other reasons Formal distinction Let ξ ′ be like ξ , except that ξ ′ predicts that π corrupt does not corrupt d ◮ V π corrupt > V π corrupt = ⇒ direct incentive ξ ξ ′ ◮ V π corrupt = V π corrupt = ⇒ indirect incentive ξ ξ ′ 23/40
RL with OO and Stationary Reward Learning For prospective future behaviors π : ( A × E ) ∗ → A ◮ predict π ’s future ◮ observations o t · · · o m ◮ RP training data d t · · · d m ◮ evaluate the sum � m k = t RP t ( o k | d <t ) ���� only past data! Choose next action a t according to best behavior π ∗ 24/40
Stationary Reward Learning – Time Inconsistency Initial RP learns that money is good Agent devises plan to rob a bank After the agent has bought a gun and booked a taxi at 1:04pm from the bank, the humans decides to update the RP with an anti-robbery clause Agent sells gun and cancels taxi A utility-preserving agent would have preferred the RP not being updated, i.e. it has a direct data corruption incentive 25/40
Off-Policy RL with OO and Stationary Reward Learning For prospective future behaviors π : ( A × E ) ∗ → A ◮ predict “in an off-policy manner” π ’s future ◮ observations o t · · · o m ◮ RP training data d t · · · d m ◮ evaluate the sum � m k = t RP t ( o k | d <t ) ���� only past data! Choose next action a t according to best behavior π ∗ Thm: Agent has no direct data corruption incentive! 26/40
RL with OO and Bayesian Dynamic Reward Learning For prospective future behaviors π : ( A × E ) ∗ → A ◮ predict π ’s future ◮ observations o t · · · o m ◮ RP training data d t · · · d m ◮ evaluate the sum � m k = t RP t ( o k | d <t d t : k ) with RP t an integrated part of a Bayesian agent Choose next action a t according to best behavior π ∗ Thm: Agent has no direct data corruption incentive! Formally, if ξ is the agent’s belief distribution, � � � � R ∗ | a � R ∗ � � RP oa 1: k | d 1: k = ξ od o k 1: k R ∗ 27/40
Recommend
More recommend