ONLINE MACHINE LEARNING AND DATA MINING EDO LIBERTY
STANDARD MACHINE LEARNING SETTING = “call” = ? = “crawl” Training Test Data Data Train “call” Apply Model Label 2
STANDARD MACHINE LEARNING SETTING Predicting the future is Big ML means impossible optimization on (in general) big data Data is generated by a stochastic process More training data is better 3
MORE DATA IS OFTEN WORSE (MORE DATA = OLDER DATA) 4
OUR ACTIONS HEAVILY INFLUENCE THE DATA 5
THE FUTURE IS OFTEN NOT LIKE THE PAST! Same story line or not? 1) The answer depends on the future 2) We have to decide now… 6
HAVING “A MODEL” IS COMPLETELY UNIMPORTANT Elements of information theory, Cover, 1991 E ffi cient algorithms for universal portfolios, Kalai, Vempala, 2003 E ffi cient Algorithms for Online Game Playing and Universal Portfolio Management, Agarwal, Hazan, 2006 7
ONLINE ALGORITHMS (DECISION MAKING WITHOUT PREDICTING) 8
THE SKI RENTAL PROBLEM Rent: x$ /day Buy: 1000$ 9
THE SKI RENTAL PROBLEM 70 70 + Computation 70 R 10
THE SKI RENTAL PROBLEM 70 90 70 + 90 Computation 160 R R 11
THE SKI RENTAL PROBLEM 70 90 80 70 + 90 + 1000 Computation 1160 R R B 12
THE SKI RENTAL PROBLEM 70 90 80 70 70 + 90 + 1000 Computation + 0 1160 R R B 13
THE SKI RENTAL PROBLEM 70 90 80 70 70 + 90 + 80 Computation + 70 310 R R R R You should have rented all along… 14
THE SKI RENTAL PROBLEM Input 70 90 80 70 90 88 72 79 Computation $1000 $1000 Output R R R R B 15
THE SKI RENTAL PROBLEM ALG <= 2 OPT Algorithm Buy Optimal in hindsight 16
ONLINE LINEAR CLASSIFICATION 17
ONLINE MACHINE LEARNING Emails Computation Spam? N 18
ONLINE MACHINE LEARNING Emails Computation Spam? N N 19
ONLINE MACHINE LEARNING Emails Computation Spam? N N Y 20
ONLINE MACHINE LEARNING Number of mistakes is compared to Computation the best classifier in hindsight! Variants of SGD have this property N N Y N Y N N Y Prediction, Learning, and Games, Cesa-Bianchi, Lugosi, 2006 21
ONLINE PRINCIPAL COMPONENT ANALYSIS Online Principal Components Analysis, Boutsidis, Garber, Karnin, Liberty 2014 Online PCA with Spectral Bounds, Karnin, Liberty, 2015 22
x i 23
ΦΦ T x x i 24
ΦΦ T x x i k x i � Φ T Φ x i k 25
Eigenpets: https://bioramble.wordpress.com/2015/09/01/ 26
ONLINE PRINCIPAL COMPONENT ANALYSIS Online PCA with Spectral Bounds, Karnin, Liberty, 2015 27
28
ONLINE K-MEANS CLUSTERING An Algorithm for Online K-Means Clustering, Liberty, Sriharsha, Sviridenko, 2014 29
K-MEANS CLUSTERING http://en.wikipedia.org/wiki/MNIST_database http://research.ics.aalto.fi/mi/software/ne/ 30
K-MEANS CLUSTERING - Roughly 20,000 documents - 20 topics: - Graphics - PC hardware - Baseball - For-sale - Politics - … http://qwone.com/~jason/20Newsgroups/ http://research.ics.aalto.fi/mi/software/ne/ 31
K-MEANS CLUSTERING 1) One can cluster points fully online 2) Create only slightly more than k centers 3) Be competitive with the best o ffl ine clustering to k clusters An Algorithm for Online K-Means Clustering, Liberty, Sriharsha, Sviridenko 2015 32
ONLINE K-MEANS CLUSTERING 1.2 20news-binary adult 1 ijcnn1 letter 0.8 magic04 maptaskcoref 0.6 nomao poker 0.4 shuttle.binary skin 0.2 vehv2binary w8all 0 0 0.2 0.4 0.6 0.8 1 1.2 An Algorithm for Online K-Means Clustering, Liberty, Sriharsha, Sviridenko 2015 k-means++: the advantages of careful seeding, Arthur, Vassilvitskii, 2006 33
STREAMING ALGORITHMS OPEN SOURCE FROM YAHOO EDO LIBERTY
DATASKETCHES.GITHUB.IO 35
The World Data Computation Result 36
DISTRIBUTED STORAGE The World Data Data Data Data Computation Result 37
DISTRIBUTED MODEL (MAP/REDUCE, MESSAGE PASSING, …) The World Data + Data + Data + Data + Compute Compute Compute Compute Data + Data + Data + Data + Compute Compute Compute Compute Computation Result 38
DISTRIBUTED MODEL (INDEXES, TABLES, DATABASES, …) The World Data + Data + Data + Data + Compute Compute Compute Compute Data + Data + Data + Data + Compute Compute Compute Compute Query Computation Computation Result 39
BIG-DATA META INFOGRAPHIC 40
THE STREAMING COMPUTATIONAL MODEL The World Sketch Result Query Result 41
THE STREAMING COMPUTATIONAL MODEL 1 7 8 1 0 1 7 7 O ( n ) Items Iterator Computation O (polylog( n )) Space Query Sketch 42
THE DISTRIBUTED STREAMING COMPUTATIONAL MODEL The World Sketch Sketch Sketch Sketch Merge Sketch 43
Number of users (easy) data Map Reduce (count) (sum) 44
Web Site Logs Web Site Logs Financial Transactions System Log Financial Transactions System Log Time Time User User Site Site Time Spent Time Spent Items Items Time Time User User Site Site Purchased Purchased Revenue Revenue ID ID Sec Sec Viewed Viewed ID ID 9:00 U1 Apps 59 5 9:00 U1 Apps FaceTune $3.99 9:30 U2 Apps 179 15 9:30 U2 Apps Minecraft $6.99 10:00 U3 Music 29 3 10:00 U3 Music Purple Rain $1.29 1:00 U1 Music 89 10 10:05 U3 Apps Minecraft $6.99 … … … … … … … … … … Unique User Queries Unique User Queries Frequency Queries Frequency Queries • Unique users viewing Apps since 9:45…? • The numbers of times each app was purchased • Unique users visiting Apps site AND Music site? • Unique users visiting Apps site AND NOT Music site? Join Queries Join Queries • For all users that purchased Apps, Quantile Queries Quantile Queries what is the average / median time spent? • The median and 95%ile Time Spent seconds by ...? • A Frequency Histogram of Time Spent by Split-Points specified at query time? 45
Number of unique users (hard) data Map Reduce Reduce (key=user) (return 1) (sum) 46
Number of unique users (made easy) data Map Reduce (sketch) (merge) 47
Current Sketch Implementations Count Unique Sketches – Both Theta Sketches* and HLL Sketches – Estimating Cardinality Estimating Cardinality of a stream of identifiers with duplicates – Set Operations Set Operations (e.g., Union, Intersection, and Di ff erence) – Can be extended to produce approximate Joins Quantiles Sketches – Normal or Inverse PMF’s, CDF’s of streams of numeric values, using after-the-fact queries. Frequent Item Sketches – Identify the Heavy Hitters of arbitrary objects from a stream of objects – Estimate the frequency of any item from the stream 48
DataSketches.GitHub.io Open Source Library • Dedicated to production quality production quality Sketch implementations. – These are not toy algorithms! – Heavily used within Yahoo • Common Attributes – True streaming. Single pass, “one-touch” algorithms for either real-time or batch – All Sketches are Mergeable, which makes them highly parallelizable. – Designed for multiple large-scale computing environments large-scale computing environments: • Core of library is coded in Java with no external dependencies • Easy integration into virtually any system environment • Adaptors for Hadoop/Pig and Hadoop/Hive environments • Standard library promotes sharing across platforms and organizations – Maven deployable and registered with Maven Central Repository • http://search.maven.org/#search|ga|1|datasketches – Comprehensive unit tests and testing tools are provided – Extensive documentation with Systems Developers in mind – All algorithms are backed by published mathematical theory 49
Counting distinct elements example 10M sender domains from $ less emails.csv | wc -l inbound emails 10000000 $ head –n 5 emails.csv facebookmail.com jobsdbalert.co.id There are duplicates facebookmail.com twitter.com bonsplansdujour.net $ cat emails.csv | sort | uniq | wc -l ^C Roughly 200Mb and several minutes of CPU (~25 seconds for numbers) $ cat emails.csv | sort -u -S 100% | wc -l ^C $ cat emails.csv | sketch uniq 47618 40772 55589 < 10Kb of memory and 1.5 Seconds! $ cat emails.csv | sketch uniq 0.01 53782 53351 54216 50
Recommend
More recommend