safe machine learning
play

Safe Machine Learning Silvia Chiappa & Jan Leike ICML 2019 ML - PowerPoint PPT Presentation

Safe Machine Learning Silvia Chiappa & Jan Leike ICML 2019 ML Research Reality horns offline datasets nose annotated a long time ago tail simulated environments abstract domains also more cute restart experiments at will ...


  1. Safe Machine Learning Silvia Chiappa & Jan Leike · ICML 2019

  2. ML Research Reality horns offline datasets nose annotated a long time ago tail simulated environments … abstract domains also more cute restart experiments at will ... Image credit: Keenan Crane & Nepluno CC BY-SA

  3. Deploying ML in the real world has real-world consequences @janleike

  4. Deploying ML in the real world has real-world consequences @janleike

  5. Why safety? faults misuse fairness fake news short-term biased datasets deep fakes safe exploration spamming adversarial robustness privacy interpretability ... ... alignment automated hacking long-term shutdown problems terrorism reward hacking totalitarianism ... ... @janleike

  6. Why safety? faults misuse fairness fake news short-term biased datasets deep fakes safe exploration spamming adversarial robustness privacy interpretability ... ... alignment automated hacking long-term shutdown problems terrorism reward hacking totalitarianism ... ... @janleike

  7. Why safety? faults misuse fake news biased datasets short-term deep fakes … spamming privacy ... safe exploration adversarial robustness fairness, alignment adversarial testing interpretability automated hacking long-term terrorism totalitarianism shutdown problems ... reward hacking ... @janleike

  8. The space of safety problems Ortega et al. (2018) Specification Robustness Assurance Behave according to intentions Withstand perturbations Analyze & monitor activity @janleike

  9. Safety in a nutshell @janleike

  10. Safety in a nutshell Where does this come from? (Specification) @janleike

  11. Safety in a nutshell Where does this come from? (Specification) What about rare cases/adversaries? (Robustness) @janleike

  12. Safety in a nutshell Where does this How good is our come from? approximation? (Specification) (Assurance) What about rare cases/adversaries? (Robustness) @janleike

  13. Outline Intro Specification for RL Assurance – break – Specification: Fairness @janleike

  14. Specification Does the system behave as intended? @janleike

  15. Degenerate solutions and misspecifications The surprising creativity of digital evolution (Lehman et al., 2017) https://youtu.be/TaXUZfwACVE @janleike

  16. Degenerate solutions and misspecifications The surprising creativity of digital Faulty reward functions in the wild evolution (Lehman et al., 2017) (Amodei & Clark, 2016) https://youtu.be/TaXUZfwACVE https://openai.com/blog/faulty-rewar d-functions/ More examples: tinyurl.com/specification-gaming (H/T Victoria Krakovna) @janleike

  17. Degenerate solutions and misspecifications The surprising creativity of digital Faulty reward functions in the wild evolution (Lehman et al., 2017) (Amodei & Clark, 2016) https://youtu.be/TaXUZfwACVE https://openai.com/blog/faulty-rewar d-functions/ More examples: tinyurl.com/specification-gaming (H/T Victoria Krakovna) @janleike

  18. What if we train agents with a human in the loop? @janleike

  19. Algorithms for training agents from human data myopic nonmyopic demos IRL behavioral cloning GAIL feedback TAMER RL from COACH modeled rewards @janleike

  20. Algorithms for training agents from human data myopic nonmyopic demos IRL behavioral cloning GAIL feedback TAMER RL from COACH modeled rewards @janleike

  21. Potential performance Imitation TAMER/COACH RL from modeled rewards performance human @janleike

  22. Specifying behavior move 37 circling boat AlphaGo Lee Sedol @janleike

  23. Specifying behavior move 37 circling boat AlphaGo Lee Sedol @janleike

  24. Reward modeling @janleike

  25. Reward modeling @janleike

  26. Learning rewards from preferences: the Bradley-Terry model Akrour et al. (MLKDD 2011), Christiano et al. (NeurIPS 2018) @janleike

  27. Reward modeling on Atari Reaching superhuman performance Outperforming “vanilla” RL best human score Christiano et al. (NeurIPS 2018) @janleike

  28. Imitation learning + reward modeling imitation policy RL reward model demos preferences Ibarz et al. (NeurIPS 2018) @janleike

  29. Scaling up What about domains too complex for human feedback? Safety via debate Iterated amplification Recursive reward modeling Irving et al. (2018) Christiano et al. (2018) Leike et al. (2018) @janleike

  30. Reward model exploitation Ibarz et al. (NeurIPS 2018) 1. Freeze successfully trained reward model 2. Train new agent on it 3. Agent finds loophole Solution : train the reward model online , together with the agent @janleike

  31. A selection of other specification work @janleike

  32. Avoiding unsafe states by blocking actions 4.5h of human oversight 0 unsafe actions in Space Invaders Saunders et al. (AAMAS 2018) @janleike

  33. Shutdown problems > 0 ⇒ agent wants to prolong the episode (disable the off-switch) < 0 ⇒ agent wants to shorten the episode (press the off-switch) Safe interruptibility The off-switch game Q-learning is safely interruptible, but not SARSA Solution: retain uncertainty over the reward Solution: treat interruptions as off-policy data function ⇒ agent doesn’t know the sign of the return Orseau and Armstrong (UAI, 2016) Hadfield-Menell et al. (IJCAI 2017) @janleike

  34. Understanding agent incentives Causal influence diagrams Impact measures Estimate difference, e.g. # steps between states ● ● # of reachable states difference in value ● Everitt et al. (2019) Krakovna et al. (2018) @janleike

  35. Assurance Analyzing, monitoring, and controlling systems during operation. @janleike

  36. White-box analysis Saliency maps Finding the channel that most supports a decision Maximizing activation of neurons/layers Olah et al. (Distill, 2017, 2018) @janleike

  37. Black-box analysis: finding rare failures ● Approximate “ AVF ” f: initial MDP state ⟼ P[failure] Train on a family of related ● agents of varying robustness ● ⇒ Bootstrapping by learning the structure of difficult inputs on weaker agents Result: failures found ~1,000x faster Uesato et al. (2018) @janleike

  38. Verification of neural networks Reluplex Interval bound propagation ฀ -local robustness at point x 0 : ● Rewrite this as SAT formula with linear terms ● Use an SMT-solver to solve the formula ● Reluplex : special algorithm for ImageNet downscaled to 64x64: branching with ReLUs ● Verified adversarial robustness of 6-layer MLP with ~13k parameters Katz et al. (CAV 2017) Ehlers (ATVA 2017), Gowal et al. (2018) @janleike

  39. Questions?

  40. — 10 min break —

  41. Part II Specification: Fairness Silvia Chiappa · ICML 2019

  42. ML systems used in areas that severely affect people lives Financial lending ○ Hiring ○ Online advertising ○ Criminal risk assessment ○ Child welfare ○ Health care ○ Surveillance ○

  43. Two examples of problematic systems 1. Criminal Risk Assessment Tools Defendants are assigned scores that predict the risk of re-committing crimes. These scores inform decisions about bail, sentencing, and parole. Current systems have been accused of being biased against black people. 2. Face Recognition Systems Considered for surveillance and self-driving cars. Current systems have been reported to perform poorly, especially on minorities.

  44. From public optimism to concern The Economist Attitudes to police technology are changing—not only among American civilians but among the cops themselves. Until recently Americans seemed willing to let police deploy new technologies in the name of public safety. But technological scepticism is growing. On May 14th San Francisco became the first American city to ban its agencies from using facial recognition systems.

  45. One fairness definition or one framework? “Nobody has found a definition which is 21 Fairness Definitions and Their widely agreed as a good definition of Politics. Arvind Narayanan. fairness in the same way we have for, say, ACM Conference on Fairness, the security of a random number Accountability, and Transparency Tutorial (2018) generator.” “There are a number of definitions and S. Mitchell, E. Potash, and S. Barocas (2018) research groups are not on the same P. Gajane and M. Pechenizkiy (2018) S. Verma and J. Rubin (2018) page when it comes to the definition of fairness.” Differences/connections between “The search for one true definition fairness definitions are difficult to is not a fruitful direction, as grasp. technical considerations cannot We lack common language/framework. adjudicate moral debates.”

  46. Common group-fairness definitions (binary classification setting) Dataset Demographic Parity sensitive attribute ● class label ● The percentage of individuals prediction of the class ● assigned to class 1 should be the features ● same for groups A=0 and A=1. Males Females

  47. Common group-fairness definitions Equal False Positive/Negative Rates Predictive Parity (EFPRs/EFNRs)

  48. The Law Regulated Domains Lending, Education, Hiring, Housing (extends to target advertising). Protected (Sensitive) Groups Reflect the fact that in the past there have been unjust practices.

Recommend


More recommend