similarity encoding for learning on dirty categorical
play

Similarity encoding for learning on dirty categorical variables Ga - PowerPoint PPT Presentation

Similarity encoding for learning on dirty categorical variables Ga el Varoquaux Scikit-learn project lead Agenda today Bring to light a problem Show that statistical-learning can solve it Machine learning Let X R n p G


  1. Similarity encoding for learning on dirty categorical variables Ga¨ el Varoquaux ⋆ ⋆ Scikit-learn project lead Agenda today Bring to light a problem Show that statistical-learning can solve it

  2. Machine learning Let X ∈ R n × p G Varoquaux 2

  3. Machine learning Let X ∈ R n × p The data Gender Date Hired Employee Position Title M 09/12/1988 Master Police Officer F 11/19/1989 Social Worker IV M 07/16/2007 Police Officer III F 02/05/2007 Police Aide M 01/13/2014 Electrician I M 04/28/2002 Bus Operator M 03/02/2008 Bus Operator F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M 11/22/2010 Library Assistant I G Varoquaux 2

  4. Machine learning Let X ∈ R n × p The data Gender Date Hired Employee Position Title M 09/12/1988 Master Police Officer F 11/19/1989 Social Worker IV M 07/16/2007 Police Officer III A data cleaning problem? F 02/05/2007 Police Aide M 01/13/2014 Electrician I A feature engineering problem? M 04/28/2002 Bus Operator M 03/02/2008 Bus Operator F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M 11/22/2010 Library Assistant I G Varoquaux 2

  5. The problem of “dirty categories” Non-curated categorical entries Overlapping categories “Master Police Officer”, Employee Position Title “Police Officer III”, Master Police Officer Social Worker IV “Police Officer II”... Police Officer III Police Aide High cardinality Electrician I 400 unique entries Bus Operator in 10 000 rows Bus Operator Social Worker III Rare categories Library Assistant I Only 1 “Architect III” Library Assistant I New categories in test set G Varoquaux 3

  6. Dirty categories in the wild Employee Salaries : salary information for employees of Montgomery County, Maryland. Employee Position Title Master Police Officer Social Worker IV ... G Varoquaux 4

  7. Dirty categories in the wild Employee Salaries : salary information for employees of Montgomery County, Maryland. Open Payments : payments by health care companies to medical doctors or hospitals. Company name Frequency Pfizer Inc. 79,073 Pfizer Pharmaceuticals LLC 486 Pfizer International LLC 425 Pfizer Limited 13 Pfizer Corporation Hong Kong Limited 4 Pfizer Pharmaceuticals Korea Limited 3 ... G Varoquaux 4

  8. Dirty categories in the wild Employee Salaries : salary information for employees of Montgomery County, Maryland. Open Payments : payments by health care companies to medical doctors or hospitals. Medical charges : patient discharges: utilization, payment, and hospital-specific charges across 3 000 US hospitals. ... Nothing on UCI machine-learning data repository G Varoquaux 4

  9. Dirty categories in the wild 100 √ n beer reviews road safety traffic violations Number of categories midwest survey 10 000 open payments employee salaries medical charges 1 000 100 5 log 2 ( n ) 100 1k 10k 100k 1M Number of rows G Varoquaux 5

  10. Mechanisms creating dirty categories Typos Open-ended entries Merging different data sources G Varoquaux 6

  11. Our goal : a statistical view of supervised learning on dirty categories Pfizer Corporation Hong Kong The statistical question = ? should inform curation Pfizer Pharmaceuticals Korea Rest of the talk : 1 Related approaches 2 Similarity encoding 3 Empirical study G Varoquaux 7

  12. 1 Related approaches Database cleaning Natural language processing Machine learning G Varoquaux 8

  13. 1 A database cleaning point of view Recognizing / merging entities Record linkage : matching across different (clean) tables Deduplication/fuzzy matching : matching in one dirty table Techniques [Fellegi and Sunter 1969] Supervised learning (known matches) Clustering Expectation Maximization to learn a metric Outputs a “clean” database G Varoquaux 9

  14. 1 A natural language processing point of view Stemming / normalization Set of (handcrafted) rules Need to be adapted to new language / new domains G Varoquaux 10

  15. 1 A natural language processing point of view Stemming / normalization Set of (handcrafted) rules Need to be adapted to new language / new domains Semantics Relate different discreet objects Formal semantics (entity resolution in knowlege bases) Distributional semantics: “a word is characterized by the company it keeps” G Varoquaux 10

  16. 1 A natural language processing point of view Stemming / normalization Set of (handcrafted) rules Need to be adapted to new language / new domains Semantics Relate different discreet objects Formal semantics (entity resolution in knowlege bases) Distributional semantics: “a word is characterized by the company it keeps” Character-level NLP For entity resolution [Klein... 2003] For semantics [Bojanowski... 2017] “London” & “Londres” may carry different information G Varoquaux 10

  17. 1 A machine-learning point of view High-cardinality categorical data Encoding each category blows up the dimension Target encoding [Micci-Barreca 2001] Represent each category by a simple statistical link to the target y eg E [ y | X i = C k ] 1D real-number embedding for a categorical column Bring close categories with same link to y Great for tree-based machine-learning [Dorogush...] G Varoquaux 11

  18. 1 A machine-learning point of view High-cardinality categorical data Encoding each category blows up the dimension Target encoding [Micci-Barreca 2001] Represent each category by a simple statistical link to the target y eg E [ y | X i = C k ] 1D real-number embedding for a categorical column Bring close categories with same link to y Great for tree-based machine-learning [Dorogush...] But fails on unseen categories G Varoquaux 11

  19. 2 Similarity encoding [P. Cerda, G. Varoquaux, & B. Kegl, Machine Learning 2018] G Varoquaux 12

  20. 2 Similarity encoding [P. Cerda, G. Varoquaux, & B. Kegl, Machine Learning 2018] 1 . One-hot encoding maps categories to vector spaces 2 . String similarities capture information G Varoquaux 12

  21. 2 Adding similarities to one-hot encoding One-hot encoding London Londres Paris Londres 0 1 0 X ∈ R n × p London 1 0 0 Paris 0 0 1 p grows fast new categories? link categories? G Varoquaux 13

  22. 2 Adding similarities to one-hot encoding One-hot encoding London Londres Paris Londres 0 1 0 X ∈ R n × p London 1 0 0 Paris 0 0 1 p grows fast new categories? link categories? Similarity encoding London Londres Paris Londres 0.3 1.0 0.0 London 1.0 0.3 0.0 Paris 0.0 0.0 1.0 string distance (Londres, London) G Varoquaux 13

  23. 2 Some string similarities Levenshtein Number of edit operations on one string to match the other Jaro-Winkler m 3 | s 2 | + m − t m d jaro ( s 1 , s 2 ) = 3 | s 1 | + 3 m m : number of matching characters t : number of character transpositions n-gram similarity n-gram: group of n consecutive characters similarity = #n-gram in comon #n-gram in total G Varoquaux 14

  24. 3 Empirical study G Varoquaux 15

  25. 3 Datasets with dirty categories Dataset # of # of cat- Less frequent Prediction rows egories category type medical charges 160k 100 613 regression employee salaries 9.2k 385 1 regression open payments 100k 973 1 binary clf midwest survey 2.8k 1009 1 multiclass clf traffic violations 100k 3043 1 multiclass clf road safety 10k 4617 1 binary clf beer reviews 10k 4634 1 multiclass clf 7 datasets! All open G Varoquaux 16

  26. 3 Experiments Cross-validation & measure prediction Focus on prediction rather than in-sample statistics Easier non-parametric evaluation Amenable to high dimension G Varoquaux 17

  27. 3 Results: gradient boosted trees Average ranking across datasets Hash encoding 5.9 One­hot encoding 4.6 Target encoding 3.7 Jaro­winkler 2.9 Similarity encoding Levenshtein 2.4 ratio 3­gram 1.6 0.8 0.9 0.6 0.8 0.6 0.8 0.5 0.7 0.6 0.8 0.4 0.5 0.25 0.75 medical employee open midwest traffic road beer charges salaries payments survey violations safety reviews G Varoquaux 18

  28. 3 Results: gradient boosted trees Average ranking across datasets Hash encoding 5.9 One­hot encoding 4.6 Target encoding 3.7 Jaro­winkler 2.9 Similarity encoding Levenshtein 2.4 ratio 3­gram 1.6 0.8 0.9 0.6 0.8 0.6 0.8 0.5 0.7 0.6 0.8 0.4 0.5 0.25 0.75 medical employee open midwest traffic road beer charges salaries payments survey violations safety reviews G Varoquaux 18

  29. 3 Results: gradient boosted trees Average ranking across datasets Hash encoding 5.9 One­hot encoding 4.6 Target encoding 3.7 Jaro­winkler 2.9 Similarity encoding Levenshtein 2.4 ratio 3­gram 1.6 0.8 0.9 0.6 0.8 0.6 0.8 0.5 0.7 0.6 0.8 0.4 0.5 0.25 0.75 medical employee open midwest traffic road beer charges salaries payments survey violations safety reviews G Varoquaux 18

  30. 3 Results: gradient boosted trees Average ranking across datasets Hash encoding 5.9 One­hot encoding 4.6 Target encoding 3.7 Jaro­winkler 2.9 Similarity encoding Levenshtein 2.4 ratio 3­gram 1.6 0.8 0.9 0.6 0.8 0.6 0.8 0.5 0.7 0.6 0.8 0.4 0.5 0.25 0.75 medical employee open midwest traffic road beer charges salaries payments survey violations safety reviews G Varoquaux 18

Recommend


More recommend