COARSE-TO-FINE, COST-SENSITIVE CLASSIFICATION OF E-MAIL Jay Pujara - PowerPoint PPT Presentation

COARSE-TO-FINE, COST-SENSITIVE CLASSIFICATION OF E-MAIL Jay Pujara jay@cs.umd.edu Lise Getoor getoor@cs.umd.edu 12/10/2010

Parallel Coarse-to-Fine Problems ¨ Structure in output ¤ Labels naturally have a hierarchy from coarse-to-fine ¨ Structure in input ¤ Features may have an order or systemic dependency ¤ Acquisition costs vary: cheap or expensive features ¨ Exploit structure during classification ¨ Minimize costs

E-mail Challenges: Spam Detection • Most mail is spam Spam Ham • Billions of classifications • Must be incredibly fast

E-mail Challenges: Categorizing Mail Spam Ham • E-mail does more, tasks such as: • Extract receipts, tracking info Social Business Network • Thread conversations • Filter into mailing lists Personal • Inline social network response Newsgroup • Computationally intensive processing • Each task applies to one class

Features have costs & dependencies Network Cache $ packets Size Derived IP features Cost Derived Mail From features Derived Subject features Derived Body $$$ features IP is known at socket connect time, is 4 bytes in size

Features have costs & dependencies Network Cache $ packets Size Derived IP features Cost Derived Mail From features Derived Subject features Derived Body $$$ features The Mail From is one of the first commands of an SMTP conversation From addresses have a known format, but higher diversity

Features have costs & dependencies Network Cache $ packets Size Derived IP features Cost Derived Mail From features Derived Subject features Derived Body $$$ features The subject, one of the mail headers, occurs after a number of network exchanges. Since the subject is user-generated, it is very diverse and often lacks a defined format

Coarse task is constrained by feature cost Feature Structure Class Structure $ Derived IP Ham Spam features Granularity Cost Derived Mail From features Social Business Network Derived Subject Personal features Newsgroup Derived Body $$$ features

Fine task is constrained by misclassification cost Feature Structure Class Structure $ Derived IP Ham Spam features Granularity Cost Derived Mail From features Social Business Network Derived Subject Personal features Newsgroup Derived Body $$$ features

Approach: Granular Cost Sensitive Classifier Training: ¨ Loss functions of form: L= α FC + (1- α ) MC ¨ Choose α c and α f for coarse and fine tasks ¨ Calculate margin threshold where feature acquisition decreases loss across training data Test: ¨ Compute decision margin with available features ¨ Acquire features until margin above threshold ¨ Classify instance

Experimental Setup Class Messages Feature Cost Spam 531 IP .168 Business 187 MailFrom .322 Social Network 223 Subject .510 Newsletter 174 Personal/Other 102 • Data from 1227 Yahoo! Mail messages from 8/2010 • Feature costs calculated from network + storage cost

Results Feature Set Feature Cost Misclass Cost Coarse Fine Overall Fixed: IP+MailFrom .490 .098 .214 .164 GCSC: α c =.3, α f =.05 .479 .091 .174 .141 Fixed: IP+MailFrom+Subject 1.00 .090 .176 .144 GCSC: α c =.15, α f =.01 .511 .088 .175 .140 • Evaluated NB & SVM base classifiers, NB results shown • Compare fixed features vs. GCSC with 10-fold L1O CV • Same feature cost, decrease misclassification cost • Decrease feature cost, same misclassification cost

Dynamics of choosing α c and α f As α c increases, disparity in costs for different values of α f widens

Conclusion ¨ Examine a problem setting with coarse-to-fine structure in both input and output ¨ Propose a classifier, mapping input to output ¤ at different granularities ¤ sensitive to feature and misclassification costs ¨ Demonstrate results superior to baseline ¨ Details at http://bit.ly/jay_c2f_2010 Questions? Research funded by Yahoo! Faculty Research Engagement Program

COARSE-TO-FINE, COST-SENSITIVE CLASSIFICATION OF E-MAIL Jay Pujara - PowerPoint PPT Presentation

COARSE-TO-FINE, COST-SENSITIVE CLASSIFICATION OF E-MAIL Jay Pujara jay@cs.umd.edu Lise Getoor getoor@cs.umd.edu 12/10/2010 Parallel Coarse-to-Fine Problems Structure in output Labels naturally have a hierarchy from coarse-to-fine

Mail Service Quality Support: Mail Service Quality Support: Mail Service Quality Support: Mail

Coarse-graining Markov state models with PCCA Coarse-graining Markov state models

Lattice Alignment Align must be linear can be random reference signals => coarse

A6: Sensitive Data Exposure A6 Sensitive Data Exposure Sensitive data stored or transmitted

Fine Grained Access Control Fine-Grained Access Control Fine Grained Access Control

Coarse & Fine Solids Separation Process Overview For Operators ABC West Coast Operator

A marriage of rely/guarantee & separation logic Viktor V afeiadis MPI - SWS Coarse - grain

Coarse Woody Debris as Measurable Management Targets A.J. Kroll Weyerhaeuser COARSE WOODY

New design method for C30 recycled concr ete using mixed source concrete coarse agg regates

Some categorical aspects of coarse spaces and balleans Nicol` o Zava joint work with Dikran

Robust Regression with Coarse Data Marco Cattaneo and Andrea Wiencierz Department of Statistics,

TUTORIAL - TUTORIAL -ABC ABC TOTAL COST for a COST OBJECT TOTAL COST for a COST OBJECT

Fine Grinding - IsaMill 11 Fine Grinding There are several commercially available fine

Fine Arts in RISD Presented by Jeff Bradford Executive Director of Fine Arts RISD Board Meeting

Fine-Grained Access Control Fine Grained Access Control Fine-grained access control examples:

Addressing Inter-Class Similarity in Fine-Grained Visual Classification Abhimanyu Dubey

Efficient Algorithms and Problem Complexity Techniques for Constructing Reductions Frank

Learning: Nave Bayes Classifier CE417: Introduction to Artificial Intelligence Sharif

Practical Cryptanalysis of ARMADILLO-2 Mar a Naya-Plasencia and Thomas Peyrin University of

Why is it so difficult to co-operate? Implementing information literacy into the curriculum of

Part I (B) Find the hidden parameter! (Fixed parameter tractable problems) (C) Find an approximate

Implementation Report: Release of the System-centric Middleware Component for Universal Multicast

RAAGs in Ham Misha Kapovich UC Davis June 30, 2011 Motivation M is a compact surface, is

Aggregation and Ordering in Factorized Databases B akibayev, K o y, O lteanu, and Z cisk

COARSE-TO-FINE, COST-SENSITIVE CLASSIFICATION OF E-MAIL Jay Pujara - PowerPoint PPT Presentation

COARSE-TO-FINE, COST-SENSITIVE CLASSIFICATION OF E-MAIL Jay Pujara jay@cs.umd.edu Lise Getoor getoor@cs.umd.edu 12/10/2010 Parallel Coarse-to-Fine Problems Structure in output Labels naturally have a hierarchy from coarse-to-fine

Mail Service Quality Support: Mail Service Quality Support: Mail Service Quality Support: Mail

Coarse-graining Markov state models with PCCA Coarse-graining Markov state models

Lattice Alignment Align must be linear can be random reference signals =&gt; coarse

A6: Sensitive Data Exposure A6 Sensitive Data Exposure Sensitive data stored or transmitted

Fine Grained Access Control Fine-Grained Access Control Fine Grained Access Control

Coarse &amp; Fine Solids Separation Process Overview For Operators ABC West Coast Operator

A marriage of rely/guarantee &amp; separation logic Viktor V afeiadis MPI - SWS Coarse - grain

Coarse Woody Debris as Measurable Management Targets A.J. Kroll Weyerhaeuser COARSE WOODY

New design method for C30 recycled concr ete using mixed source concrete coarse agg regates

Some categorical aspects of coarse spaces and balleans Nicol` o Zava joint work with Dikran

Robust Regression with Coarse Data Marco Cattaneo and Andrea Wiencierz Department of Statistics,

TUTORIAL - TUTORIAL -ABC ABC TOTAL COST for a COST OBJECT TOTAL COST for a COST OBJECT

Fine Grinding - IsaMill 11 Fine Grinding There are several commercially available fine

Fine Arts in RISD Presented by Jeff Bradford Executive Director of Fine Arts RISD Board Meeting

Fine-Grained Access Control Fine Grained Access Control Fine-grained access control examples:

Addressing Inter-Class Similarity in Fine-Grained Visual Classification Abhimanyu Dubey

Efficient Algorithms and Problem Complexity Techniques for Constructing Reductions Frank

Learning: Nave Bayes Classifier CE417: Introduction to Artificial Intelligence Sharif

Practical Cryptanalysis of ARMADILLO-2 Mar a Naya-Plasencia and Thomas Peyrin University of

Why is it so difficult to co-operate? Implementing information literacy into the curriculum of

Part I (B) Find the hidden parameter! (Fixed parameter tractable problems) (C) Find an approximate

Implementation Report: Release of the System-centric Middleware Component for Universal Multicast

RAAGs in Ham Misha Kapovich UC Davis June 30, 2011 Motivation M is a compact surface, is

Aggregation and Ordering in Factorized Databases B akibayev, K o y, O lteanu, and Z cisk

Lattice Alignment Align must be linear can be random reference signals => coarse

Coarse & Fine Solids Separation Process Overview For Operators ABC West Coast Operator

A marriage of rely/guarantee & separation logic Viktor V afeiadis MPI - SWS Coarse - grain