Data Stream Classification using Random Feature Functions and Novel Method Combinations Jesse Read ( 1 ) , Albert Bifet ( 2 ) Department of Information and Computer Science (1) Aalto University and HIIT, Finland (2) Huawei Noah’s Ark Lab, Hong Kong August 21, 2015
Classification in Data Streams Setting: sequence is potentially infinite high speed of arrival stream is one-way, can’t ‘go back’ Implications memory is limited adapt to concept drift
Classification in Data Streams
Methods for Data Streams Naive Bayes Stochastic Gradient Descent Incremental Decision Trees (Hoeffding Tree) Lazy/prototype methods (e.g., k NN) Batch-learning (e.g., SVMs, decision trees)
Stochastic Gradient Descent (SGD) update weights incrementally y x 1 x 2 x 3
Stochastic Gradient Descent (SGD) update weights incrementally y x 1 x 2 x 3 relatively poor performance fiddly hyper parameters, particularly with multiple layers
� � � � � � Hoeffding Trees Very Fast Decision Trees: obtain, incrementally , a certain level of confidence (Hoeffding bound) on which attribute to split on x 1 ≤ 0 . 3 > 0 . 3 y x 3 ≤− 2 . 9 > − 2 . 9 y x 2 = A = B y y
� � � � � � Hoeffding Trees Very Fast Decision Trees: obtain, incrementally , a certain level of confidence (Hoeffding bound) on which attribute to split on x 1 ≤ 0 . 3 > 0 . 3 y x 3 ≤− 2 . 9 > − 2 . 9 y x 2 = A = B y y tree may grow conservatively, although using Naive Bayes at the leaves mitigates this need to start again when concept changes
k -Nearest Neighbours ( k NN) maintain a dynamic buffer of instances, compare each test instance to these 3 c1 c2 2 c3 c4 1 c5 c6 ? 0 x 2 1 2 3 4 4 3 2 1 0 1 2 3 x 1
k -Nearest Neighbours ( k NN) maintain a dynamic buffer of instances, compare each test instance to these 3 c1 c2 2 c3 c4 1 c5 c6 ? 0 x 2 1 2 3 4 4 3 2 1 0 1 2 3 x 1 automatically adapts to concept drift limited buffer size , sensitive to number of attributes
Preliminary Comparison Table: Results under prequential evaluation Dataset #Att. #Ins. HT SGD kNN LB-HT Electricity 8 45,312 79.2 57.6 78.4 89.8 CoverType 54 581,012 80.3 60.7 92.2 91.7 SUSY 8 5,000,000 78.2 76.5 67.5 78.7 k NN performs relatively poorly on larger streams Hoeffding Tree works well, particularly in Leveraging Bagging ensemble SGD performs poorly overall, less so on larger streams
A Modular View of Datastream Classification y z (1) z (1) z (1) 1 2 3 y (1) y (2) y (3) T T T z 1 z 2 z 3 z 4 x 1 x 2 x 3 x 4 x 5 e.g., x → filter → ensemble of HT → filter → SGD → y
Random Projections as a Filter Back-propagation often too fiddly for data streams Can do random projection instead 8 6 4 2 0 3 2 1 2 0 1 -1 0 -1 -2 -2 Not a new idea, but fits nicely into MOA as a filter can be used with any classifier
Results: Random Projections ELEC SUSY 90 80 HT HT HT f HT f 85 75 kNN kNN kNN f kNN f 80 70 SGD SGD SGD f SGD f 75 65 Accuracy Accuracy 70 60 65 55 60 50 55 45 0 1 2 3 0 1 2 3 10 10 10 10 10 10 10 10 h/d h/d SGD responds best to a randomly projected input space For k NN, it depends on the stream No advantage for HT, however can put, for example, k NN or SGD f (‘filtered’-SGD) in the leaves of HT
Figure: SUSY , First 50,000 examples, in 100 windows. Accuracy 80 75 HT kNN kNN f 70 SGD SGD f HT-SGD HT-SGD f 65 HT-kNN LB-HT LB-SGD f 60 0 20 40 60 80 100 SGD f works well on larger data streams ...also at the leaves of a Hoeffding Tree ( HT-SGD f ) filter helps k NN (but only worth it on smaller streams)
Figure: SUSY , 5,000,000 examples, in 100 windows. Accuracy 80 75 HT kNN kNN f 70 SGD SGD f HT-SGD HT-SGD f 65 HT-kNN LB-HT LB-SGD f 60 0 20 40 60 80 100 SGD f works well on larger data streams ...also at the leaves of a Hoeffding Tree ( HT-SGD f ) filter helps k NN (but only worth it on smaller streams)
Other results (see paper for details): HT-kNN ( k NN in the leaves) often outperforms benchmark HT ( -NB ) and competitive with LB-HT (tied best overall) which has an ensemble of 10 HTs Figure: Running time (s) on SUSY over 5,000,000 examples Method Time (s) 25 SGD 118 SGD f 1464 kNN 4714 kNN f 45 HT 159 HT-SGD f 1428 HT-kNN 530 LB-HT 1040 LB-SGD f
Summary A modular view of data stream classification allows a large number of method combinations to be trialled New competitive combinations were found Random feature functions can work well in a number of different contexts No extra parameter calibration is necessary on these methods Future work: prune and replace nodes overtime
Data Stream Classification using Random Feature Functions and Novel Method Combinations Jesse Read ( 1 ) , Albert Bifet ( 2 ) Department of Information and Computer Science (1) Aalto University and HIIT, Finland (2) Huawei Noah’s Ark Lab, Hong Kong August 21, 2015
Recommend
More recommend