Jointly Modeling Relevance and Sensitivity for Search Among - PowerPoint PPT Presentation

Jointly Modeling Relevance and Sensitivity for Search Among Sensitive Content Mahmoud F. Sayed , Douglas W. Oard

2 Image credit: HITEC Dubai

10,045 FOIA requests ~ 30k work-related emails 3

E-Discovery Requesting Party Responding Party 1. Formulation 2. Acquisition 3. Review for Relevance 4. Review for 5. Analysis Privilege ~ 75% total cost ~ 1 month 4

Motivation ● Review is expensive ○ Hiring law firms ● Review is time-consuming ○ Long elapsed time between request and its response ○ Not effective access to information ● Objective is to build “Search and Protection Engines” ○ Protect sensitive content Learning to Rank ○ Still retrieve relevant content ○ Affordable ○ Fast Automatic Sensitivity Classification 5

Proposed Approaches Prefilter Documents Result Filter Ranker Query Sensitivity Classifier Postfilter Documents Result Ranker Filter Query Sensitivity Classifier 6

How to evaluate such approaches? 7

Discounted Cumulative Gain (DCG) Highly Relevant Somewhat Relevant Not Relevant Retrieved +3 +1 0 Not Retrieved 0 0 0 Highly Relevant Somewhat Relevant Not Relevant DCG 5 = 5.7 8

Cost-Sensitive DCG (CS-DCG) Highly Relevant Somewhat Relevant Not Relevant Retrieved +3 +1 0 Not Retrieved 0 0 0 Sensitive Not Sensitive Retrieved -10 0 Not Retrieved 0 0 Highly Relevant Somewhat Relevant Sensitive Neither Relevant nor Sensitive CS-DCG 5 = 5.7 CS-DCG 5 = -4.3 9

Normalized CS-DCG (nCS-DCG) Worst Ranking Best Ranking Highly Relevant Somewhat Relevant CS-DCG worst = -19.8 CS-DCG 5 = -4.3 CS-DCG 5 = 5.7 CS-DCG best = 5.95 Sensitive nCS-DCG 5 = 0.60 nCS-DCG 5 = 0.71 Neither Relevant 10 nor Sensitive

Experiments 11

LETOR OHSUMED Test Collection ● 348,566 medical publications ○ Fields: title, abstract, Medical Subject Heading (MeSH), etc ○ 14,430 (w/rel judgements) for eval ○ 334,136 for sensitivity classifier training ● 106 queries (~150 rel judgements per query) ○ 3 levels: (2) Highly Relevant, (1) Somewhat Relevant, and (0) Not Relevant ● Simulating “sensitivity” ○ 2 MeSH labels represent sensitive content (out of 118) ■ Male Urogenital Diseases [C12] ■ Female Urogenital Diseases and Pregnancy Complications [C13] ○ 12.2% of judged documents are sensitive 12

Sensitivity is Topic-Dependent Hard topics Easy topics 13

nCS-DCG@10 Comparison 14

Proposed Approaches Prefilter Listwise LtR Optimizing nCS-DCG Documents Result Filter Ranker Joint Query Sensitivity Documents Result Ranker Classifier Sensitivity Query Classifier Postfilter Documents Result Ranker Filter Query Sensitivity Classifier 15

nCS-DCG@10 Comparison Listwise LtR 16

CS-DCG@10 Comparison 20.7% 44.3% 27.3% 25.4% Can we reduce number of queries with negative CS-DCG scores? 17

Cluster-Based Replacement (CBR) 11% ● Similar to diversity ranking ○ Retrieved documents are clustered ○ For any potentially sensitive document 20.7% in the result list is replaced with a document in the same cluster but less sensitive 20 clusters using repeated bisection 18

CBR Adversely Affects nCS-DCG No filter Prefilter Postfilter Joint unclustered clustered unclustered clustered unclustered clustered unclustered clustered BM25 0.727 0.779* 0.800 0.797 0.800 0.797 0.727 0.779* 0.761 0.764 0.811* 0.785 0.817* 0.785 0.727 0.790* Linear reg. 0.765 0.771 0.812* 0.788 0.823* 0.792 0.753 0.786* LambdaMart AdaRank 0.756 0.779 0.822* 0.792 0.817* 0.791 0.823* 0.799 Coor. Ascent 0.762 0.781 0.816* 0.791 0.818* 0.790 0.842* 0.805 * Indicates two-tailed t-test with p<0.05 19

Conclusion ● Proposed CS-DCG and nCS-DCG to balance between relevance and sensitivity ● Joint modeling approach yields better performance than straightforward approaches ● Cluster-based replacement can reduce number of queries with negative CS-DCG scores 20

Next Steps ● Train a sensitivity classifier with fewer examples ● Build test collections with real sensitivities ● Experiment with tri-state classification ○ Sensitive ○ Needs human review ○ Not Sensitive 21

Data and code can be found at https://github.com/mfayoub/SASC Thanks! Mahmoud F. Sayed mfayoub@cs.umd.edu 22

Jointly Modeling Relevance and Sensitivity for Search Among - PowerPoint PPT Presentation

Jointly Modeling Relevance and Sensitivity for Search Among Sensitive Content Mahmoud F. Sayed , Douglas W. Oard 2 Image credit: HITEC Dubai 10,045 FOIA requests ~ 30k work-related emails 3 E-Discovery Requesting Party Responding Party

Climate Sensitivity We consider climate sensitivity in a very simple context. Climate Sensitivity

Topic of this talk Topic of this talk From E- -Relevance Relevance From E to W- -Relevance

Sensitivity Of Quake3 Players Sensitivity Of Quake3 Players Sensitivity Of Quake3 Players

Sensitivity to Market Risks 1 METAC Workshop Sensitivity to Market Risks I OVERVIEW A

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Relevance Feedback Relevance Feedback Relevance Feedback Prof. Paolo Ciaccia Prof. Paolo

Jointly and the Jointly ecosystem Madeleine Starr Director of Business Development and

Search Engines Issues Avi Rappoport Search Tools Consulting Search Issues Enterprise Search

Modeling Relevance Gain Evaluation, session 4 CS6200: Information Retrieval Expected Relevance

Search Relevance Organizational Maturity Model MICES 2019 Berlin | Eric Pugh | @dep4b Search

Foundations of Artificial Intelligence 9. State-Space Search: Tree Search and Graph Search Malte

Tabu Search Search Tabu Page 1 Part I Part I Tabu Search Principles Search Principles Tabu

Uninformed Search 2 Informed Search Rest of blind search An informed search strategyone

Informed search algorithms Outline Best-first search Greedy best-first search A *

Abelian returns in Sturmian words S. Puzynina jointly with L. Q. Zamboni S. Puzynina jointly

Search + Discovery Peter Bourgon Evolution of search Relevance ranking A bit about SOA

California State University Northridge Department of Physical Therapy Department Chair : Janna

Functional Behavior Assessments: Practical and Versatile Resources for School Psychologists

Fitting high resolution structures into low resolution EM maps Michael Rossmann Purdue

Q2FY20 Financial Results Presentation For the quarter ended 30 September 2019 Chua Sock Koong,

Baryogenesis Matter vs Anti-matter Earth, Solar system B made of baryons Our Galaxy p

THE PREJUDICE WE FACE: SUPPORTING YOUNG ADULT PARENTS WITH MENTAL HEALTH CONDITIONS TO MEET THEIR

Calculating Correct Compilers Patrick Bahr 1 Graham Hutton 2 1 University of Copenhagen,

Crystal Field Theory It is not a bonding theory Method of explaining some physical