Fa FairTest: : Disc Discovering un ering unwarr arran anted - PowerPoint PPT Presentation

Fa FairTest: : Disc Discovering un ering unwarr arran anted asso ed associa4o cia4ons in ns in da data- a-driv driven n applic applica4o a4ons ns IEEE IEEE Eur EuroS&P S&P Ap April 28t 28th, 2017 2017 Florian Tramèr 1 , Vaggelis Atlidakis 2 , Roxana Geambasu 2 , Daniel Hsu 2 , Jean-Pierre Hubaux 3 , Mathias Humbert 4 , Ari Juels 5 , Huang Lin 3 1 Stanford University, 2 Columbia University, 3 École Polytechnique Fédérale de Lausanne, 4 Saarland University, 5 Cornell Tech 1

“Unfair” associa4ons + + consequences 2

“Unfair” associa4ons + + consequences 3 These are so#ware bugs : need to ac#vely test for them and fix them (i.e., debug) in data-driven applicaSons… just as with func#onality, performance, and reliability bugs.

Un Unwarr arran anted A ed Asso ssocia4o ia4ons Mo ns Model del 4 Data-driven User inputs ApplicaSon outputs applicaSon Protected inputs

Limi mits of preventa4ve me measures 5 What doesn’t work : • Hide protected aUributes from data-driven applicaSon. • Aim for staSsScal parity w.r.t. protected classes and service output. Foremost challenge is to even detect these unwarranted associaSons.

A Fr Frame mework for Unwarranted Associa4ons 6 1. Specify relevant data features : (e.g., Gender, Race, … ) • Protected variables (e.g., Price, Error rate, … ) • “USlity”: a funcSon of the algorithm’s output • Explanatory variables (e.g., QualificaSons) (e.g., LocaSon, Job, … ) • Contextual variables 2. Find sta6s6cally significant associa6ons between protected aUributes and uSlity • Condi#on on explanatory variables • Not Sed to any parScular sta#s#cal metric (e.g., odds raSo) 3. Granular search in seman6cally meaningful subpopula6ons • Efficiently list subgroups with highest adverse effects

Fa FairTest: a t : a tes4ng suit es4ng suite f e for da r data- a-driv driven app en apps s 7 • Finds context-specific associaSons between protected variables and applicaSon outputs, condiSoned on explanatory variables • Bug report ranks findings by assoc. strength and affected pop. size locaSon, click, … prices, tags, … Data-driven User inputs ApplicaSon outputs applicaSon race, gender, … Protected vars. FairTest zip code, job, … Context vars. Explanatory vars. qualificaSons, … AssociaSon bug report for developer

A data-d A d -driven en a approa oach ch 8 Core of FairTest is based on staSsScal machine learning Report of associations of O=Price on S i =Income: Find context-specific associaSons Assoc. metric: norm. mutual information (NMI). Global Population of size 494,436 p-value=3.34e-10 ; NMI=[0.0001, 0.0005] Price Income <$50K Income >=$50K Total FairTest Training data High 15301 (6%) 13867 (6%) 29168 (6%) Low 234167(94%) 231101(94%) 465268 (94%) Data Total 249468(50%) 244968(50%) 494436(100%) 1. Subpopulation of size 23,532 Test data Context= { State: CA, Race: White } StaSsScally validate associaSons p-value=2.31e-24 ; NMI=[0.0051, 0.0203] Price Income <$50K Income >=$50K Total High 606 (8%) 691 (4%) 1297 (6%) Ideally sampled from Low 7116(92%) 15119(96%) 22235 (94%) Total 7722(33%) 15810(67%) 23532(100%) relevant user populaSon 2. Subpopulation of size 2,198 Sta6s6cal machine learning internals : Context= { State: NY, Race: Black, Gender: Male } p-value=7.72e-05 ; NMI=[0.0040, 0.0975] • top-down spaSal parSSoning algorithm Price Income <$50K Income >=$50K Total High 52 (4%) 8 (1%) 60 (3%) Low 1201(96%) 937(99%) 2138 (97%) • confidence intervals for assoc. metrics Total 1253(57%) 945(43%) 2198(100%) ...more entries (sorted by decreasing NMI)... • …

Reports for Fairness bugs Re 9 • Example: simulaSon of Report of associations of O=Price on S i =Income: locaSon based pricing Assoc. metric: norm. mutual information (NMI). Global Population of size 494,436 scheme p-value=3.34e-10 ; NMI=[0.0001, 0.0005] Price Income <$50K Income >=$50K Total High 15301 (6%) 13867 (6%) 29168 (6%) • Test for disparate impact on Low 234167(94%) 231101(94%) 465268 (94%) Total 249468(50%) 244968(50%) 494436(100%) low-income popula6ons 1. Subpopulation of size 23,532 Context= { State: CA, Race: White } p-value=2.31e-24 ; NMI=[0.0051, 0.0203] • Low effect over whole US Price Income <$50K Income >=$50K Total populaSon High 1297 (6%) 606 (8%) 691 (4%) Low 7116(92%) 15119(96%) 22235 (94%) Total 7722(33%) 15810(67%) 23532(100%) • High effects in specific sub- 2. Subpopulation of size 2,198 Context= { State: NY, Race: Black, Gender: Male } populaSons p-value=7.72e-05 ; NMI=[0.0040, 0.0975] Price Income <$50K Income >=$50K Total High 52 (4%) 8 (1%) 60 (3%) Low 1201(96%) 937(99%) 2138 (97%) Total 1253(57%) 945(43%) 2198(100%)

As Associ ocia4on on-Gu -Guided ed D Deci ecision on T Trees ees 10 Goal: find most strongly affected user sub-populaSons A C Split into sub-populaSons with OccupaSon Increasingly strong associaSons B between protected variables … and applica6on outputs < 50 ≥ 50 Age … …

Associ As ocia4on on-Gu -Guided ed D Deci ecision on T Trees ees 11 • Efficient discovery of contexts with high associaSons • Outperforms previous approaches based on frequent itemset mining • Easily interpretable contexts by default • AssociaSon-metric agnosSc Metric Use Case Binary raSo/difference Binary variables Mutual InformaSon Categorical variables Pearson CorrelaSon Scalar variables Regression High dimensional outputs Plugin your own! ??? • Greedy strategy (some bugs could be missed)

Examp mple: healthcare applica4on 12 Predictor of whether pa6ent will visit hospital again in next year (from winner of 2012 Heritage Health Prize CompeSSon) FairTest findings : strong associaSon between age and predicSon error rate Hospital Will paSent be age, gender, re-admission re-admiUed? # emergencies, … predictor AssociaSon may translate to quanSfiable harms (e.g., if model is used to adjust insurance premiums)

Debug Debugging with ging with Fa FairTest 13 Are there confounding factors ? Do associaSons disappear aqer condiSoning? ⇒ AdapSve Data Analysis! Example: the healthcare applicaSon (again) High confidence in predic#on • EsSmate predicSon confidence (target variance) • Does this explain the predictor’s behavior? • Yes, parSally FairTest helps developers understand & evaluate potenSal associaSon bugs.

Ot Other er a applica4on ons s studied ed u using Fa FairTest 14 • Image tagger based on ImageNet data ⇒ Large output space (~1000 labels) ⇒ FairTest automaScally switches to regression metrics ⇒ Tagger has higher error rate for pictures of black people • Simple movie recommender system ⇒ Men are assigned movies with lower ra#ngs than women ⇒ Use personal preferences as explanatory factor ⇒ FairTest finds no significant bias anymore

Closing rema marks 15 The Unwarranted Associa/ons Framework • Captures a broader set of algorithmic biases than in prior work • Principled approach for staSsScally valid invesSgaSons FairTest • The first end-to-end system for evaluaSng algorithmic fairness Developers need beOer sta6s6cal training and tools to make beOer sta6s6cal decisions and applica6ons. hUp://arxiv.org/abs/1510.02377

Examp mple: Berkeley graduate admi missions 16 Admission into UC Berkeley graduate programs (Bickel, Hammel, and O’Connell, 1975) Bickel et al ’s (and also FairTest’s) findings : gender bias in admissions at university level, but mostly gone aqer condiSoning on department Graduate admissions Admit applicant? age, gender, GPA, … commiUees FairTest helps developers understand & evaluate potenSal associaSon bugs.

Fa FairTest: : Disc Discovering un ering unwarr arran anted - PowerPoint PPT Presentation

Fa FairTest: : Disc Discovering un ering unwarr arran anted asso ed associa4o cia4ons in ns in da data- a-driv driven n applic applica4o a4ons ns IEEE IEEE Eur EuroS&P S&P Ap April 28t 28th, 2017 2017 Florian

Dong Liu EE Department, Columbia University Dec 20, 2011 Tag has become one of the most

Conditional independence concept in various uncertainty calculi Milan Studen y Institute of

Adaptive mesh refinement method : numerical density of entropy production and automatic

A state-independent preference representation in he continuous case David R os, Enrique

Building a relevance platform with Couchbase and Elasticsearch @jreijn | Hippo #gotoams, June

JRuby The enterprise ruby Stefan Magnus Landr RubyFools Oslo 3. April 2008 Agenda

Getting Serious About a Platform Independent Application for the Usage of Mobile Moodle Quizzes:

RELIGIOSITY AND ATHEISM (Guideline) Press Release (Members may please customize as necessary)

Best Practices and Solutions for Addressing Food Waste November 4, 2014 Conversation in

CSE 110A: Winter 2020 Fundamentals of Compiler Design I Introduction and Overview Owen

Usenix FAST 2008 Conference, San Jose (02/27/2008) Outrageous opinion statement Name:

Randomized Branch Sampling (RBS): Size software projects without wasting time analyzing each user

Welcome to the Youth PQA Crash Course. This PowerPoint presentation is designed to guide you

The Imitation Game: The New Frontline of Security Fighting Robots Weve been warned for a

Resource Use & Distribution To Support Planetary Health & Boundaries: Wasted Food and

Empya : Saving Energy in the Face of Varying Workloads Christopher Eibel , Thao-Nguyen Do, Robert

MINDSET DANHAESLER EDUCATOR | WRITER | CONSULTANT @danhaesler AS SOON AS STUDENTS BECOME ABLE

Its a Good Place to Be: Northwest Regional Messaging and Marketing Toolkit Behavior, Energy

Changelog Changes made in this version not seen in fjrst lecture: 18 Feb 2019: counting to binary

Unified error reporting -- A worthy goal? Andi Kleen, Intel Corporation Sep 2009

Exploring The Value of Energy Disaggregation through actionable feedback Nipun Batra , Amarjeet

WILL ANY PASSWORD DO? EXPLORING RATE-LIMITING ON THE WEB WAY18, Baltimore, MD, USA, 12 August

MDB: A Memory-Mapped Database and Backend for OpenLDAP Howard Chu CTO, Symas Corp.

An Empirical Evaluation and Comparison of f Manual and Automated Test Selection Milos Gligoric,

Fa FairTest: : Disc Discovering un ering unwarr arran anted - PowerPoint PPT Presentation

Fa FairTest: : Disc Discovering un ering unwarr arran anted asso ed associa4o cia4ons in ns in da data- a-driv driven n applic applica4o a4ons ns IEEE IEEE Eur EuroS&P S&P Ap April 28t 28th, 2017 2017 Florian

Dong Liu EE Department, Columbia University Dec 20, 2011 Tag has become one of the most

Conditional independence concept in various uncertainty calculi Milan Studen y Institute of

Adaptive mesh refinement method : numerical density of entropy production and automatic

A state-independent preference representation in he continuous case David R os, Enrique

Building a relevance platform with Couchbase and Elasticsearch @jreijn | Hippo #gotoams, June

JRuby The enterprise ruby Stefan Magnus Landr RubyFools Oslo 3. April 2008 Agenda

Getting Serious About a Platform Independent Application for the Usage of Mobile Moodle Quizzes:

RELIGIOSITY AND ATHEISM (Guideline) Press Release (Members may please customize as necessary)

Best Practices and Solutions for Addressing Food Waste November 4, 2014 Conversation in

CSE 110A: Winter 2020 Fundamentals of Compiler Design I Introduction and Overview Owen

Usenix FAST 2008 Conference, San Jose (02/27/2008) Outrageous opinion statement Name:

Randomized Branch Sampling (RBS): Size software projects without wasting time analyzing each user

Welcome to the Youth PQA Crash Course. This PowerPoint presentation is designed to guide you

The Imitation Game: The New Frontline of Security Fighting Robots Weve been warned for a

Resource Use &amp; Distribution To Support Planetary Health &amp; Boundaries: Wasted Food and

Empya : Saving Energy in the Face of Varying Workloads Christopher Eibel , Thao-Nguyen Do, Robert

MINDSET DANHAESLER EDUCATOR | WRITER | CONSULTANT @danhaesler AS SOON AS STUDENTS BECOME ABLE

Its a Good Place to Be: Northwest Regional Messaging and Marketing Toolkit Behavior, Energy

Changelog Changes made in this version not seen in fjrst lecture: 18 Feb 2019: counting to binary

Unified error reporting -- A worthy goal? Andi Kleen, Intel Corporation Sep 2009

Exploring The Value of Energy Disaggregation through actionable feedback Nipun Batra , Amarjeet

WILL ANY PASSWORD DO? EXPLORING RATE-LIMITING ON THE WEB WAY18, Baltimore, MD, USA, 12 August

MDB: A Memory-Mapped Database and Backend for OpenLDAP Howard Chu CTO, Symas Corp.

An Empirical Evaluation and Comparison of f Manual and Automated Test Selection Milos Gligoric,

Resource Use & Distribution To Support Planetary Health & Boundaries: Wasted Food and