data stream classification using random feature functions
play

Data Stream Classification using Random Feature Functions and Novel - PowerPoint PPT Presentation

Data Stream Classification using Random Feature Functions and Novel Method Combinations Jesse Read ( 1 ) , Albert Bifet ( 2 ) Department of Information and Computer Science (1) Aalto University and HIIT, Finland (2) Huawei Noahs Ark Lab,


  1. Data Stream Classification using Random Feature Functions and Novel Method Combinations Jesse Read ( 1 ) , Albert Bifet ( 2 ) Department of Information and Computer Science (1) Aalto University and HIIT, Finland (2) Huawei Noah’s Ark Lab, Hong Kong August 21, 2015

  2. Classification in Data Streams Setting: sequence is potentially infinite high speed of arrival stream is one-way, can’t ‘go back’ Implications memory is limited adapt to concept drift

  3. Classification in Data Streams

  4. Methods for Data Streams Naive Bayes Stochastic Gradient Descent Incremental Decision Trees (Hoeffding Tree) Lazy/prototype methods (e.g., k NN) Batch-learning (e.g., SVMs, decision trees)

  5. Stochastic Gradient Descent (SGD) update weights incrementally y x 1 x 2 x 3

  6. Stochastic Gradient Descent (SGD) update weights incrementally y x 1 x 2 x 3 relatively poor performance fiddly hyper parameters, particularly with multiple layers

  7. � � � � � � Hoeffding Trees Very Fast Decision Trees: obtain, incrementally , a certain level of confidence (Hoeffding bound) on which attribute to split on x 1 ≤ 0 . 3 > 0 . 3 y x 3 ≤− 2 . 9 > − 2 . 9 y x 2 = A = B y y

  8. � � � � � � Hoeffding Trees Very Fast Decision Trees: obtain, incrementally , a certain level of confidence (Hoeffding bound) on which attribute to split on x 1 ≤ 0 . 3 > 0 . 3 y x 3 ≤− 2 . 9 > − 2 . 9 y x 2 = A = B y y tree may grow conservatively, although using Naive Bayes at the leaves mitigates this need to start again when concept changes

  9. k -Nearest Neighbours ( k NN) maintain a dynamic buffer of instances, compare each test instance to these 3 c1 c2 2 c3 c4 1 c5 c6 ? 0 x 2 1 2 3 4 4 3 2 1 0 1 2 3 x 1

  10. k -Nearest Neighbours ( k NN) maintain a dynamic buffer of instances, compare each test instance to these 3 c1 c2 2 c3 c4 1 c5 c6 ? 0 x 2 1 2 3 4 4 3 2 1 0 1 2 3 x 1 automatically adapts to concept drift limited buffer size , sensitive to number of attributes

  11. Preliminary Comparison Table: Results under prequential evaluation Dataset #Att. #Ins. HT SGD kNN LB-HT Electricity 8 45,312 79.2 57.6 78.4 89.8 CoverType 54 581,012 80.3 60.7 92.2 91.7 SUSY 8 5,000,000 78.2 76.5 67.5 78.7 k NN performs relatively poorly on larger streams Hoeffding Tree works well, particularly in Leveraging Bagging ensemble SGD performs poorly overall, less so on larger streams

  12. A Modular View of Datastream Classification y z (1) z (1) z (1) 1 2 3 y (1) y (2) y (3) T T T z 1 z 2 z 3 z 4 x 1 x 2 x 3 x 4 x 5 e.g., x → filter → ensemble of HT → filter → SGD → y

  13. Random Projections as a Filter Back-propagation often too fiddly for data streams Can do random projection instead 8 6 4 2 0 3 2 1 2 0 1 -1 0 -1 -2 -2 Not a new idea, but fits nicely into MOA as a filter can be used with any classifier

  14. Results: Random Projections ELEC SUSY 90 80 HT HT HT f HT f 85 75 kNN kNN kNN f kNN f 80 70 SGD SGD SGD f SGD f 75 65 Accuracy Accuracy 70 60 65 55 60 50 55 45 0 1 2 3 0 1 2 3 10 10 10 10 10 10 10 10 h/d h/d SGD responds best to a randomly projected input space For k NN, it depends on the stream No advantage for HT, however can put, for example, k NN or SGD f (‘filtered’-SGD) in the leaves of HT

  15. Figure: SUSY , First 50,000 examples, in 100 windows. Accuracy 80 75 HT kNN kNN f 70 SGD SGD f HT-SGD HT-SGD f 65 HT-kNN LB-HT LB-SGD f 60 0 20 40 60 80 100 SGD f works well on larger data streams ...also at the leaves of a Hoeffding Tree ( HT-SGD f ) filter helps k NN (but only worth it on smaller streams)

  16. Figure: SUSY , 5,000,000 examples, in 100 windows. Accuracy 80 75 HT kNN kNN f 70 SGD SGD f HT-SGD HT-SGD f 65 HT-kNN LB-HT LB-SGD f 60 0 20 40 60 80 100 SGD f works well on larger data streams ...also at the leaves of a Hoeffding Tree ( HT-SGD f ) filter helps k NN (but only worth it on smaller streams)

  17. Other results (see paper for details): HT-kNN ( k NN in the leaves) often outperforms benchmark HT ( -NB ) and competitive with LB-HT (tied best overall) which has an ensemble of 10 HTs Figure: Running time (s) on SUSY over 5,000,000 examples Method Time (s) 25 SGD 118 SGD f 1464 kNN 4714 kNN f 45 HT 159 HT-SGD f 1428 HT-kNN 530 LB-HT 1040 LB-SGD f

  18. Summary A modular view of data stream classification allows a large number of method combinations to be trialled New competitive combinations were found Random feature functions can work well in a number of different contexts No extra parameter calibration is necessary on these methods Future work: prune and replace nodes overtime

  19. Data Stream Classification using Random Feature Functions and Novel Method Combinations Jesse Read ( 1 ) , Albert Bifet ( 2 ) Department of Information and Computer Science (1) Aalto University and HIIT, Finland (2) Huawei Noah’s Ark Lab, Hong Kong August 21, 2015

Recommend


More recommend