Catalog Classification at the Long Tail using IR and ML Neel - PowerPoint PPT Presentation

Catalog Classification at the Long Tail using IR and ML Neel Sundaresan (nsundaresan@ebay.com) Team: Badrul Sarwar, Khash Rohanimanesh, JD R Ruvini, Karin Mauge, Dan Shen ini Karin Ma ge Dan Shen

Then There was One… When asked if he understood that the laser pointer was broken, the buyer said “Of course, I’m a collector of broken laser pointers”

Divine Reward! PetroliumJeliffe Neel Sundaresan – SIGIR 2010

What we sell on a daily basis? Neel Sundaresan – SIGIR 2010

The Importance of Structured Information o atio Search Experience Recommender Systems Fraud and Counterfeit Detection Fraud and Counterfeit Detection Neel Sundaresan – SIGIR 2010

Discovering Catalogs: Challenges Our goal is to build catalogs using An unsupervised metadata extraction system Challenges Challenges Huge volume of raw text Highly unstructured High level of noise Lack of consistency/standardization of attribute name and value usage g 6 Neel Sundaresan – SIGIR 2010

Take Advantage of the Community Savvy sellers provide plenty of useful information We need to combine techniques that can q Extract attribute names and values from this large collection Remove noise and normalize attribute names and value usage Neel Sundaresan – SIGIR 2010

We have the data 8 Neel Sundaresan – SIGIR 2010

We have the data 9 Neel Sundaresan – SIGIR 2010

The BIG Picture Year: 1999 Model: premiere night Model: premiere night Edition: home shopping special Hair: blond Gown: purple and silver Condition: new / never removed from box / mint Title.. Attributes… Seller Info BARBIE 1999 "PREMIERE NIGHT" Home Shopping Special Edition Catalog Gorgeous Doll With Beautiful Blond Hair / In A Item Description API Gown Of Purple And Silver New / Never Removed From Box / Doll Is In Mint New / Never Removed From Box / Doll Is In Mint Stream Condition / Remember This Beauty Is 11 Years Old 5-10M items (20Gb) daily Free Shipping To US Only / Will Ship International Catalog / Please E-mail For Cost Indexer Feel Free To Ask Me Any Questions Or Concerns Smoke - Free Environment Free Shipping Free Shipping Large Scale Hadoop Cluster Neel Sundaresan – SIGIR 2010

Our Approach To build an automatic product catalog we follow the following steps Grouping items into categories p g g Category classification Weeding out noise through accessory classification Weeding out noise through accessory classification Extraction of attribute names and values Simple two pass approach Si pl t p ppr h Cleaning and normalization Capturing human expertise through machine learning 1 Neel Sundaresan – SIGIR 2010 1

Catalog Discovery Improve value coverage for important names Use machine learning to expand value coverage Product building Organize items in a hierarchical collection Matching inventory to products g v y p d Adoption At each step we perform machine At each step we perform machine learning/text mining techniques Neel Sundaresan – SIGIR 2010

Item Categorization Near similar titles “Apple IPOD Nano 4GB Black NEW! Great Deal!” “Apple IPOD Nano 4GB Black NEW! Skin Great Deal!” Category Classification g y Feature Selection Smoothing Smoothing Accessory classification (NBC) 1 Neel Sundaresan – SIGIR 2010 3

Class Pruning Class Pruning – unique to eBay we compute the posterior probability in NBC i.e, P(C|Title words) then for some title words ( | ) the number of class they appear is huge (for instance, the word “harry potter” appears in y p pp thousands of categories and that puts a strain on the online posterior probability computation. To p p y p fix this, we use class pruning—for a given feature we only keep a few top classes in the y p p computation. Neel Sundaresan – SIGIR 2010

Item Categorization on eBay Seller describes his item with a few keywords eBay recommends 15 categories for his/her consideration Neel Sundaresan – SIGIR 2010

Item Categorization on eBay Larry Bird Boston Celtics Signed Adidas Classic Jersey Price: US $399.99 Buy It Now Categorize into ? Sports Mem, Cards & Fan Shop > Manufacturer Authenticated > Basketball-NBA p , p Sports Mem, Cards & Fan Shop > Fan Apparel & Souvenirs > Basketball-NBA Sports Mem, Cards & Fan Shop > Autographs-Original > Basketball-NBA > Jerseys Clothing, Shoes & Accessories > Men's Clothing > Athletic Apparel Clothing, Shoes & Accessories > Men's Clothing > Shirts > T-Shirts, Tank Tops Collectibles > Advertising > Clothing, Shoes & Accessories > Clothing Neel Sundaresan – SIGIR 2010

Challenge I Large collection of categories 30K 30K categories i Hard to distinguish 1. Clothing, Shoes & Accessories � Costumes & Reenactment Attire � Costumes � Women 2. Everything Else � Adult Only � Clothing, Shoes & Accessories � Costumes & Fantasy Wear � Women Wear � Women Insufficient information of items Limited length of item title: 10 words Inaccurate or fraud title description Inaccurate or fraud title description Neel Sundaresan – SIGIR 2010

Challenge II Highly skewed item distribution 8 6.9% categories contain 17.3% items 1% categories contain 51.7% items Scalability and efficiency 4 million items daily 4 million items daily Real-time response G Good scalability and high efficiency d l bilit d hi h ffi i

Applications Recommending category candidates for seller’s listing Monitoring misclassification rate on current site Detecting outlier items Detecting outlier items Neel Sundaresan – SIGIR 2010

Method Multinomial Bayesian algorithm Smoothing Scaling-up to cope with highly skewed item distribution Data sparseness problem Common or non informative word problem Common or non-informative word problem Neel Sundaresan – SIGIR 2010

Bayesian Learning Framework We employ the Naive Bayes with Multinomial likelihood function which is to find the most likely class c with the y maximum posterior probability of generating item t item t Neel Sundaresan – SIGIR 2010

Approach Exploit Data to the Maximum Apply simple algorithms at the same time Neel Sundaresan – SIGIR 2010

Smoothing Algorithms Laplace Smoothing Jelinek-Mercer Smoothing Dirichlet Prior Dirichlet Prior Absolute Discounting Shrinkage Smoothing Neel Sundaresan – SIGIR 2010

Experiments Train: Sold items on eBay site in about one month 18 million items 18 million items 18K categories Test: Sold items in the day following the training period 278K items Neel Sundaresan – SIGIR 2010

Research Questions How various smoothing methods perform on our task? How does smoothing interact with siz size of training data set f tr inin d t s t size of category vocabulary focusedness of category Neel Sundaresan – SIGIR 2010

Overall Precision of Smoothing Methods 100 P@1 @ P@5 @ P@10 @ P@15 @ 90 80 80 70 60 3.1% 5.1% 5% 50 50 NoSmoothing Laplace Jelinek Dirichlet Absolute Shrinkage Smoothing Mercer Priors Discounting Smoothing Neel Sundaresan – SIGIR 2010

Influence of size of training set on Smoothing Larger data set leads to better performance The rate of the increase is not fixed when multiplying data set blindly Quality of training data Increasing prior sample size μ improves performance Neel Sundaresan – SIGIR 2010

Influence of Category Size on Smoothing Smoothing for insufficient sample problem Eliminate zero probability of unobserved words Eliminate zero probability of unobserved words Two data sets: LargeCat: the cat containing >10K training instances LargeCat: the cat. containing >10K training instances SmallCat: the cat. containing <1K training instances L LargeCat C t S SmallCat llC t LargeCat significantly outperforms SmallCat by 27.7% No Smoothing 69.4 35.9 Smoothing saves the system μ = 100 69.8 39.1 +3.4% on LargeCat Dirichlet μ = 500 69.4 45.1 +9.2% on SmallCat Prior μ = 1000 70.1 43.7 μ = 2000 μ 71.0 41.0 μ = 5000 72.8 35.1 Neel Sundaresan – SIGIR 2010

Influence of word specificity on Smoothing Smoothing for common or non-informative words Decrease the discrimination power of such words Decrease the discrimination power of such words Two data sets: SpecCat: the cat containing words with high IDF values SpecCat: the cat. containing words with high IDF values NotSpecCat: the cat. containing words with low IDF values SpecCat SpecCat NotSpecCat NotSpecCat SpecCat significantly outperforms No Smoothing 71.4 42.8 NotSpecCat by 31.4% μ = 100 73.0 43.7 Smoothing saves the system Dirichlet μ = 500 75.7 44.3 +5.0% on SpecCat Prior μ = 1000 75.1 44.7 +2.2% on NotSpecCat μ = 2000 μ 75.6 44.5 μ = 5000 76.4 45.0 Neel Sundaresan – SIGIR 2010

Cataloging: Extraction of Attribute Names and Values Na es a d Va ues We extract attribute names and values from millions of descriptions Harder than named entity recognition Harder than named entity recognition Attribute names are not known beforehand We employ a two-pass process l 3 Neel Sundaresan – SIGIR 2010 0

Pass 1—Name identification Use a high-precision low-recall extraction based on pattern search p Use seller count, items count and other statistics to find names to find names Neel Sundaresan – SIGIR 2010

Catalog Classification at the Long Tail using IR and ML Neel - PowerPoint PPT Presentation

Catalog Classification at the Long Tail using IR and ML Neel Sundaresan (nsundaresan@ebay.com) Team: Badrul Sarwar, Khash Rohanimanesh, JD R Ruvini, Karin Mauge, Dan Shen ini Karin Ma ge Dan Shen Then There was One When asked if he

The Long Tail as a Pow g er Curve 120 100 80 60 40 40 20 0 1 1 11 11 21 21 31 31

Day 3 Long Tail SEO Google Analytics How Google Analytics can help with our Long Tail

Sharing is Caring in the Land of The Long Tail Samy Bengio Real life setting Real problems

Race Condition Shared Data: 4 5 6 1 8 5 6 20 9 ? Synchronization and Deadlocks tail

TXTing 101: Finding Security Issues in the Long Tail of DNS TXT Records O. van der Toorn 1 R. van

IAIR T TDS V VI: D Deali ling wit ith Long T Tail il Cla laim ims October 12, 2018

The Long Tail(s) of the Law: An exploratory study Graham Greenleaf, Philip Chung & Andrew

Visiting The Catalog A Stroll Through The PostgreSQL Catalog Charles Clavadetscher Swiss

Non-Cumulation Clauses and Long-Tail Claims in CGL Policies: Latest Developments Allocating

Accept the Risk and Continue: Measuring the Long Tail of Government https Adoption Sudheesh

Accept the Risk and Continue: Measuring the Long Tail of Government https Adoption Sudheesh

Moving Down the Long Tail of Word Sense Disambiguation with Gloss Informed Bi-encoders Terra

Economics of Information Storage: The Value in Storing the Long Tail James Hughes 1975 History

Race Condition Shared Data: 5 6 4 1 8 5 6 20 9 ? InterProcess Communication tail A[]

Accept the Risk and Continue: Measuring the Long Tail of Government https Adoption Sudheesh

Bringing Long-Tail Microscopy & Characterisation Data into the Light RAiD service

VIP A Virtual Imaging Platform for the Long Tail of Science Sorina POP, Tristan GLATARD

LightKV: A Cross Media Key Value Store with Persistent Memory to Cut Long Tail Latency Shukai

improving recommendation for long-tail queries via templates Idan Szpektor Aristides Gionis

THE FUTURE OF MASS-TORT LONG-TAIL LITIGATION Stephen Hoke, Hoke LLC (Chair) Claudia Temple,

Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency Jialin Li, Naveen

Synthesis and characterization of new tail-to-tail dimers of bile acids with different spacers

Ajax Jeremy Keith clear:left buzz word buzz word 2 . 0 long tail participation

POLITICAL Everything youll need to do to graduate is listed in your catalog SCIENCE

Catalog Classification at the Long Tail using IR and ML Neel - PowerPoint PPT Presentation

Catalog Classification at the Long Tail using IR and ML Neel Sundaresan (nsundaresan@ebay.com) Team: Badrul Sarwar, Khash Rohanimanesh, JD R Ruvini, Karin Mauge, Dan Shen ini Karin Ma ge Dan Shen Then There was One When asked if he

The Long Tail as a Pow g er Curve 120 100 80 60 40 40 20 0 1 1 11 11 21 21 31 31

Day 3 Long Tail SEO Google Analytics How Google Analytics can help with our Long Tail

Sharing is Caring in the Land of The Long Tail Samy Bengio Real life setting Real problems

Race Condition Shared Data: 4 5 6 1 8 5 6 20 9 ? Synchronization and Deadlocks tail

TXTing 101: Finding Security Issues in the Long Tail of DNS TXT Records O. van der Toorn 1 R. van

IAIR T TDS V VI: D Deali ling wit ith Long T Tail il Cla laim ims October 12, 2018

The Long Tail(s) of the Law: An exploratory study Graham Greenleaf, Philip Chung &amp; Andrew

Visiting The Catalog A Stroll Through The PostgreSQL Catalog Charles Clavadetscher Swiss

Non-Cumulation Clauses and Long-Tail Claims in CGL Policies: Latest Developments Allocating

Accept the Risk and Continue: Measuring the Long Tail of Government https Adoption Sudheesh

Accept the Risk and Continue: Measuring the Long Tail of Government https Adoption Sudheesh

Moving Down the Long Tail of Word Sense Disambiguation with Gloss Informed Bi-encoders Terra

Economics of Information Storage: The Value in Storing the Long Tail James Hughes 1975 History

Race Condition Shared Data: 5 6 4 1 8 5 6 20 9 ? InterProcess Communication tail A[]

Accept the Risk and Continue: Measuring the Long Tail of Government https Adoption Sudheesh

Bringing Long-Tail Microscopy &amp; Characterisation Data into the Light RAiD service

VIP A Virtual Imaging Platform for the Long Tail of Science Sorina POP, Tristan GLATARD

LightKV: A Cross Media Key Value Store with Persistent Memory to Cut Long Tail Latency Shukai

improving recommendation for long-tail queries via templates Idan Szpektor Aristides Gionis

THE FUTURE OF MASS-TORT LONG-TAIL LITIGATION Stephen Hoke, Hoke LLC (Chair) Claudia Temple,

Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency Jialin Li, Naveen

Synthesis and characterization of new tail-to-tail dimers of bile acids with different spacers

Ajax Jeremy Keith clear:left buzz word buzz word 2 . 0 long tail participation

POLITICAL Everything youll need to do to graduate is listed in your catalog SCIENCE

The Long Tail(s) of the Law: An exploratory study Graham Greenleaf, Philip Chung & Andrew

Bringing Long-Tail Microscopy & Characterisation Data into the Light RAiD service