USING CLASSIFIER CASCADES FOR SCALABLE E-MAIL CLASSIFICATION Jay - PowerPoint PPT Presentation

USING CLASSIFIER CASCADES FOR SCALABLE E-MAIL CLASSIFICATION Jay Pujara jay@cs.umd.edu Hal Daumé III me@hal3.name Lise Getoor getoor@cs.umd.edu 2/23/2012

Building a scalable e-mail system 1 ¨ Goal: Maintain system throughput across conditions ¨ Varying conditions ¤ Load varies ¤ Resource availability varies ¤ Task varies ¨ Challenge: Build a system that can adapt its operation to the conditions at hand

Problem structure informs scalable solution 2 Feature Structure Class Structure $ Derived IP Ham Spam features Granularity Cost Derived Mail From features Social Business Network Derived Subject Personal features Newsgroup Derived Body $$$ features

Important facets of problem 3 ¨ Structure in input ¤ Features may have an order or systemic dependency ¤ Acquisition costs vary: cheap or expensive features ¨ Structure in output ¤ Labels naturally have a hierarchy from coarse-to-fine ¤ Different levels of hierarchy have different sensitivities to cost ¨ Exploit structure during classification ¨ Minimize costs, minimize error

Two overarching questions 4 ¨ When should we acquire features to classify a message? ¨ How does this acquisition policy change across different classification tasks? ¨ Classifier Cascades can answer both questions!

Introducing Classifier Cascades 5 • Series of classifiers: f 1 , f 2 , f 3 ... f n f 1 f 2 f 3 ...

Introducing Classifier Cascades 6 • Series of classifiers: Cost: c 1 f 1 , f 2 , f 3 ... f n f 1 ( ϕ 1 ) • Each classifier operates on different, increasingly expensive sets of features Cost: c 2 f 2 ( ϕ 1, ϕ 2 ) ( ϕ ) with costs c 1 , c 2 , c 3 ... c n f 3 ( ϕ 1, ϕ 2, ϕ 3 ) Cost: c 3 ...

Introducing Classifier Cascades 7 • Series of classifiers: f 1 ( ϕ 1 ) Cost: c 1 f 1 , f 2 , f 3 ... f n f 1 ( ϕ 1 ) • Each classifier operates on different, increasingly f 2 ( ϕ 1, ϕ 2 ) expensive sets of features Cost: c 2 f 2 ( ϕ 1, ϕ 2 ) ( ϕ ) with costs c 1 , c 2 , c 3 ... c n • Classifier outputs a value in [-1,1], the margin or f 3 ( ϕ 1, ϕ 2, ϕ 3 ) f 3 ( ϕ 1, ϕ 2, ϕ 3 ) Cost: c 3 confidence of decision ...

Introducing Classifier Cascades 8 • Series of classifiers: f 1 ( ϕ 1 ) Cost: c 1 f 1 , f 2 , f 3 ... f n f 1 ( ϕ 1 ) • Each classifier operates |f 1 |< γ 1 on different, increasingly f 2 ( ϕ 1, ϕ 2 ) expensive sets of features Cost: c 2 f 2 ( ϕ 1, ϕ 2 ) ( ϕ ) with costs c 1 , c 2 , c 3 ... c n • Classifier outputs a value |f 2 |< γ 2 in [-1,1], the margin or f 3 ( ϕ 1, ϕ 2, ϕ 3 ) f 3 ( ϕ 1, ϕ 2, ϕ 3 ) Cost: c 3 confidence of decision • γ parameters control the |f 3 |< γ 3 ... relationship of classifiers

Optimizing Classifier Cascades 9 ¨ Loss function: – errors in classification L ( y, F ( x )) ¨ Minimize loss function, incorporating cost ¤ Cost-constraint with budget (load-sensitive): min Σ ( x ,y ) ∈ D L ( y, F ( x )) s.t. C ( x ) < B ¤ Cost Sensitive loss function (granular): ¨ Use grid-search to find optimal γ parameters

Load-Sensitive Classification 12

Features have costs & dependencies 13 Network Cache $ packets Size Derived IP features Cost Derived Mail From features Derived Subject features Derived Body $$$ features IP is known at socket connect time, is 4 bytes in size

Features have costs & dependencies 14 Network Cache $ packets Size Derived IP features Cost Derived Mail From features Derived Subject features Derived Body $$$ features The Mail From is one of the first commands of an SMTP conversation From addresses have a known format, but higher diversity

Features have costs & dependencies 15 Network Cache $ packets Size Derived IP features Cost Derived Mail From features Derived Subject features Derived Body $$$ features The subject, one of the mail headers, occurs after a number of network exchanges. Since the subject is user-generated, it is very diverse and often lacks a defined format

Load-Sensitive Problem Setting 16 γ 2 < | f 2 | Subject γ 1 < | f 1 | MailFrom Classifier Classifier IP Classifier • Train IP , MailFrom, and Subject classifiers • For a given budget, B , choose γ 1 , γ 2 that minimize error within B • Constraint: C(x) < B

Load-Sensitive Challenges 17 ¨ Overfitting model when choosing γ 1 , γ 2 ¨ Train-time costs underestimated versus test-time performance ¨ Use a regularization constant Δ ¤ Sensitive to cost variance ( σ ) ¤ Accounts for variability ¨ Revised constraint: C(x) + ∆ σ < B

Granular Classification 18

E-mail Challenges: Spam Detection 19 • Most mail is spam Spam Ham • Billions of classifications • Must be incredibly fast

E-mail Challenges: Categorizing Mail 20 Spam Ham • E-mail does more, tasks such as: • Extract receipts, tracking info Social Business Network • Thread conversations • Filter into mailing lists Personal • Inline social network response Newsgroup • Computationally intensive processing • Each task applies to one class

Coarse task is constrained by feature cost 21 Feature Structure Class Structure $ Derived IP λ c Ham Spam features Granularity Cost Derived Mail From features Social Business Network Derived Subject Personal features Newsgroup Derived Body $$$ features

Fine task is constrained by misclassification cost 22 Feature Structure Class Structure $ Derived IP Ham Spam features Granularity Cost Derived Mail From features Social Business Network Derived Subject λ f Personal features Newsgroup Derived Body $$$ features

Granular Classification Problem Setting 23 Subject MailFrom IP Spam Ham L(y, h(x)) + λ c C(x) Social Business Network Subject MailFrom IP L(y, h(x)) + λ f C(x) Personal Newsgroup • Two separate models for different tasks, with different classifiers and cascade parameters • Choose γ 1 , γ 2 for each cascade to balance accuracy and cost with different tradeoffs λ

Experimental Results 27

Experimental Setup: Overview 28 ¨ Two tasks: load-sensitive & granular classification ¨ Two datasets: Yahoo! Mail corpus and TREC-2007 ¤ Load-sensitive uses both datasets, granular uses only Yahoo! ¨ Results are L1O, 10-fold CV with bold values significant (p<.05) ¨ Cascade stages use MEGAM MaxEnt classifier

Experimental Setup: Yahoo! Data 29 Class Messages Feature Cost Spam 531 IP .168 Business 187 MailFrom .322 Social Network 223 Subject .510 Newsletter 174 Personal/Other 102 • Data from 1227 Yahoo! Mail messages from 8/2010 • Feature costs calculated from network + storage cost

Experimental Setup: TREC data 30 Class Messages Spam 39055 Ham 8139 • Data from TREC-2007 Public Spam Corpus, 47194 messages • Use same feature cost estimates

Results: Load-Sensitive Classification Regularization prevents cost excesses 32 Y!Mail Dataset Δ Y!Mail TREC 0 .115 .059 .25 .020 0.00 Average excess cost

Results: Load-Sensitive Classification Significant error reduction 33 Classification Error across methods in different datasets 0.14 Classification Error (L(x)) 0.12 0.1 0.08 Naive ACC, Δ =0 0.06 ACC, Δ =.25 0.04 ACC, Δ =.5 0.02 0 Yahoo! Mail TREC-2007 Dataset

Results: Granular Classification 35 Feature Set Feature Cost Misclass Cost Coarse Fine Overall Fixed: IP .168 .139 .181 .229 ACC: λ c =1.5, λ f =1 .187 .140 .156 .217 Fixed: IP+MailFrom .490 .128 .142 .200 ACC: λ c =.1, λ f =.075 .431 .111 .100 .163 Fixed: IP+MailFrom+Subject 1.00 .106 .108 .162 ACC: λ c =.02, λ f =.02 .691 .108 .105 .162 • Compare fixed feature acquisition policies to adaptive classifiers • Significant gains in performance or cost (or both) depending on tradeoff

Dynamics of choosing λ c and λ f 36

Different approaches, same tradeoff 37

Conclusion 38 ¨ Problem of scalable e-mail classification ¨ Introduce two settings ¤ Load-sensitive Classification: known budget ¤ Granular Classification: task sensitivity ¨ Use classifier cascades to achieve tradeoff between cost and accuracy ¨ Demonstrate results superior to baseline Questions? Research funded by Yahoo! Faculty Research Engagement Program

USING CLASSIFIER CASCADES FOR SCALABLE E-MAIL CLASSIFICATION Jay - PowerPoint PPT Presentation

USING CLASSIFIER CASCADES FOR SCALABLE E-MAIL CLASSIFICATION Jay Pujara jay@cs.umd.edu Hal Daum III me@hal3.name Lise Getoor getoor@cs.umd.edu 2/23/2012 Building a scalable e-mail system 1 Goal: Maintain system throughput across

Mail Service Quality Support: Mail Service Quality Support: Mail Service Quality Support: Mail

The Nave Bayes Classifier Machine Learning 1 Todays lecture The nave Bayes Classifier

Information Cascades in Human Networks Milo Trujillo Professor Gao Information Cascades

Optimizing cascades & submodular optimization Rik Sarkar Today Maximizing cascades

Lazy Associative Classification Decision Tree Classifier (Eager) Associative Classifier By

Maximum Entropy Classifier Ensembling using Ge- netic Algorithm for NER in Bengali Asif Ekbal 1

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Community detection and cascades Rik Sarkar Today Community Detection Spectral

Practical DKIM Deployment ( for Mail Service Providers ) Daniel Black OVEE Systems Consultancy

Data Mining with Weka Class 2 Lesson 1 Be a classifier! Ian H. Witten Department of Computer

When and Why to use a Classifier? When and Why to use a Classifier? Alan Rector Alan Rector

When and Why to use a Classifier? When and Why to use a Classifier? Alan Rector Alan Rector

Lecture 2: Nearest Neighbour Classifier Aykut Erdem September 2017 Hacettepe University Your

Data Classification Linear Classifier II Latent Differential Analysis Mean Classification

Classifier Selection Nicholas Ver Hoeve Craig Martek Ben Gardner Classifier Ensembles Assume

Classifier Classifier Systems Systems

C P P F i n a n c i n g a n d t h I m e p a c t o f a L a r g e r

Syria Sy ria Inte In terreligi rreligious ous an and po polit litica ical l dyna nami

TRUCKS, KNIVES, BOMBS, WHATEVER EXPLORING PRO-ISLAMIC STATE INSTRUCTIONAL MATERIAL ON TELEGRAM

Employee Precaution and Preparedness Program January 2019 PROGRAM MEMBERS 2 Security Council

MCA Market Research Postal Services - Business Bulk Mail Results 26 th June 2019 Slide 1

5 Year Strategic Planning (5YSP) Presentation Glen Rock Board of Education Ad hoc Committee ~

Presented By Aaron D Michael Jason Johnson Emory Rothell Political and Election Mail

I-2 & I-69C INTERCHANGE PROPOSED IMPROVEMENTS Public Meeting June 14, 2018 Public Meeting

USING CLASSIFIER CASCADES FOR SCALABLE E-MAIL CLASSIFICATION Jay - PowerPoint PPT Presentation

USING CLASSIFIER CASCADES FOR SCALABLE E-MAIL CLASSIFICATION Jay Pujara jay@cs.umd.edu Hal Daum III me@hal3.name Lise Getoor getoor@cs.umd.edu 2/23/2012 Building a scalable e-mail system 1 Goal: Maintain system throughput across

Mail Service Quality Support: Mail Service Quality Support: Mail Service Quality Support: Mail

The Nave Bayes Classifier Machine Learning 1 Todays lecture The nave Bayes Classifier

Information Cascades in Human Networks Milo Trujillo Professor Gao Information Cascades

Optimizing cascades &amp; submodular optimization Rik Sarkar Today Maximizing cascades

Lazy Associative Classification Decision Tree Classifier (Eager) Associative Classifier By

Maximum Entropy Classifier Ensembling using Ge- netic Algorithm for NER in Bengali Asif Ekbal 1

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Community detection and cascades Rik Sarkar Today Community Detection Spectral

Practical DKIM Deployment ( for Mail Service Providers ) Daniel Black OVEE Systems Consultancy

Data Mining with Weka Class 2 Lesson 1 Be a classifier! Ian H. Witten Department of Computer

When and Why to use a Classifier? When and Why to use a Classifier? Alan Rector Alan Rector

When and Why to use a Classifier? When and Why to use a Classifier? Alan Rector Alan Rector

Lecture 2: Nearest Neighbour Classifier Aykut Erdem September 2017 Hacettepe University Your

Data Classification Linear Classifier II Latent Differential Analysis Mean Classification

Classifier Selection Nicholas Ver Hoeve Craig Martek Ben Gardner Classifier Ensembles Assume

Classifier Classifier Systems Systems

C P P F i n a n c i n g a n d t h I m e p a c t o f a L a r g e r

Syria Sy ria Inte In terreligi rreligious ous an and po polit litica ical l dyna nami

TRUCKS, KNIVES, BOMBS, WHATEVER EXPLORING PRO-ISLAMIC STATE INSTRUCTIONAL MATERIAL ON TELEGRAM

Employee Precaution and Preparedness Program January 2019 PROGRAM MEMBERS 2 Security Council

MCA Market Research Postal Services - Business Bulk Mail Results 26 th June 2019 Slide 1

5 Year Strategic Planning (5YSP) Presentation Glen Rock Board of Education Ad hoc Committee ~

Presented By Aaron D Michael Jason Johnson Emory Rothell Political and Election Mail

I-2 &amp; I-69C INTERCHANGE PROPOSED IMPROVEMENTS Public Meeting June 14, 2018 Public Meeting

Optimizing cascades & submodular optimization Rik Sarkar Today Maximizing cascades

I-2 & I-69C INTERCHANGE PROPOSED IMPROVEMENTS Public Meeting June 14, 2018 Public Meeting