Online machine learning with decision trees Max Halford University of Toulouse Online machine learning with decision trees Max Halford 1 / 46 Thursday 7 th May, 2020
Decision trees “ Most successful general-purpose algorithm in modern times. ” [HB12] Sub-divide a feature space into partitions Non-parametric and robust to noise Allow both numeric and categorical features Can be regularised in difgerent ways Good weak learners for bagging and boosting [Bre96] See [BS16] for a modern review Many popular open-source implementations [PVG + 11, CG16, KMF + 17, PGV + 18] Alas, they assume that the data can be scanned more than once, and thus can’t be used in an online context. Online machine learning with decision trees Max Halford 2 / 46
3 / 46 1 Banana dataset on OpenML Max Halford Online machine learning with decision trees Toy example: the banana dataset 1 Training set Decision function with 1 tree Decision function with 10 trees 1.0 0.8 0.6 x 2 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x 1 x 1 x 1
Online (supervised) machine learning the model, allowing the training set to also act as a validation set. No need Max Halford Online machine learning with decision trees 2. Real dri�t : 𝑄(𝑍 ∣ 𝑌) changes: 1. Virtual dri�t : 𝑄(𝑌) changes Ideally, concept dri�t [GŽB + 14] should be taken into account: for cross-validation! 𝑧 can be obtained right before 𝑧 is shown to ̂ Progressive validation [BKL99]: Online != out-of-core: 4 / 46 Model learns from samples (𝑦, 𝑧) ∈ 𝐽𝑆 𝑜×𝑞 × 𝐽𝑆 𝑜×𝑙 which arrive in sequence • Online: samples are only seen once • Out-of-core: samples can be revisited ▶ Example: many 0s with sporadic bursts of 1s ▶ Example: a feature’s importance changes through time
Online decision trees A decision tree involves enumerating split candidates Each split is evaluated by scanning the data This can’t be done online without storing data Two approaches to circumvent this: 1. Store and update feature distributions 2. Build the trees without looking at the data (!!) Bagging and boosting can be done online [OR01] Online machine learning with decision trees Max Halford 5 / 46
Consistency Trees fall under the non-parametric regression framework Goal: estimate a regression function 𝑔(𝑦) = 𝐽𝐹(𝑍 ∣ 𝑌 = 𝑦) Ideally, we also want our estimator to be unbiased We also want regularisation mechanisms in order to generalise Somewhat orthogonal to concept driħt handling Online machine learning with decision trees Max Halford 6 / 46 We estimate 𝑔 with an approximation 𝑔 𝑜 trained with 𝑜 samples 𝑔 𝑜 is consistent if 𝐽𝐹(𝑔 𝑜 (𝑌) − 𝑔(𝑌)) 2 → 0 as 𝑜 → +∞
Hoefgding trees Split thresholds 𝑢 are chosen by minimising an impurity criterion The impurity looks at the distribution of 𝑍 in each child An impurity criterion depends on 𝑄(𝑍 ∣ 𝑌 < 𝑢) 𝑄(𝑍 ∣ 𝑌 < 𝑢) can be obtained via Bayes’ rule: 𝑄(𝑌 < 𝑢) For classification, assuming 𝑌 is numeric: Online machine learning with decision trees Max Halford 7 / 46 𝑄(𝑍 ∣ 𝑌 < 𝑢) = 𝑄(𝑌 < 𝑢 ∣ 𝑍) × 𝑄(𝑍) • P(Y) is a counter • P(X < t) can be represented with a histogram • P(X < t | Y) can be represented with one histogram per class
Hoefgding tree construction algorithm A Hoefgding tree starts ofg as a leaf 𝑄(𝑍) , 𝑄(𝑌 < 𝑢) , and 𝑄(𝑌 < 𝑢 ∣ 𝑍) are updated every time a sample arrives Every so oħten, we enumerate some candidate splits and evaluate them The best split is chosen if significantly better than the second best split Significance is determined by the Hoefgding bound Once a split is chosen, the leaf becomes a branch and the same steps occur within each child Introduced in [DH00] Many variants, including revisiting split decisions when driħt occurs [HSD01] Online machine learning with decision trees Max Halford 8 / 46
Hoefgding trees on the banana dataset 𝑦 2 Max Halford Online machine learning with decision trees 10 trees 𝑦 1 1.0 0.8 0.6 0.4 0.2 0.0 Single tree 1.0 0.0 0.8 0.6 0.4 0.2 0.0 𝑦 1 1.0 0.8 0.6 0.4 0.2 9 / 46
Mondrian trees Construction follows a Mondrian process [RT + 08] Split features and points are chosen without considering their predictive power Hierarchical averaging is used to smooth leaf values First introduced in [LRT14] Improved in [MGS19] Figure: Composition A by Piet Mondrian Online machine learning with decision trees Max Halford 10 / 46
The Mondrian process Sample 𝜀 ∼ 𝑓𝑦𝑞(∑ 𝑞 Split if 𝜀 < 𝜇 The chances of splitting decrease with the size of the cells 𝜇 is a soħt maximum depth parameter More information in these slides Online machine learning with decision trees Max Halford 11 / 46 Let 𝑣 𝑘 and 𝑚 𝑘 be the bounds of feature 𝑘 in a cell 𝑘=1 𝑣 𝑘 − 𝑚 𝑘 ) Features are uniformly chosen in proportion to 𝑣 𝑘 − 𝑚 𝑘
Mondrian trees on the banana dataset Online machine learning with decision trees Max Halford 12 / 46 Single tree 10 trees 1.0 0.8 0.6 x 2 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x 1 x 1
Aggregated Mondrian trees on the banana dataset Online machine learning with decision trees Max Halford 13 / 46 Single tree 10 trees 1.0 0.8 0.6 x 2 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x 1 x 1
Purely random trees Features 𝑦 are assumed to in [0, 1] 𝑞 Trees are constructed independently from the data, before it even arrives: 1. Pick a feature at random 2. Pick a split point at random 3. Repeat until desired depth is reached When a sample reaches a leaf, said leaf’s running average is updated Easier to analyse because tree structure doesn’t depend on 𝑍 Consistency depends on: 1. The height of a tree – denoted ℎ 2. The amount of features that are “relevant” Bias analysis performed in [AG14] Word of caution: this is difgerent from extremely randomised trees [GEW06] Online machine learning with decision trees Max Halford 14 / 46
Uniform random trees Features and split points are chosen completely at random Let ℎ be the height of the tree Online machine learning with decision trees Max Halford 15 / 46 Consistent when ℎ → +∞ and ℎ 𝑜 → 0 as ℎ → +∞ [BDL08]
Uniform random trees 𝑦 1 𝑦 2 𝑦 1 𝑦 1 Online machine learning with decision trees Max Halford 16 / 46
Uniform random trees on the banana dataset 𝑦 2 Max Halford Online machine learning with decision trees 10 trees 𝑦 1 1.0 0.8 0.6 0.4 0.2 0.0 Single tree 1.0 0.0 0.8 0.6 0.4 0.2 0.0 𝑦 1 1.0 0.8 0.6 0.4 0.2 17 / 46
Centered random trees Features are chosen completely at random Split points are the mid-points of a feature’s current range Online machine learning with decision trees Max Halford 18 / 46 Consistent when ℎ → +∞ and 2 ℎ 𝑜 → 0 as ℎ → +∞ [Sco16]
Centered random trees Online machine learning with decision trees Max Halford 19 / 46 x 2 x 1 x 1 x 1
Centered random trees on the banana dataset Online machine learning with decision trees Max Halford 20 / 46 Single tree 10 trees 1.0 0.8 0.6 x 2 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x 1 x 1
How about a compromise? 2 ] Sample 𝑡 in [𝑏 + 𝜀(𝑐 − 𝑏), 𝑐 − 𝜀(𝑐 − 𝑏)] 𝜀 = 0 ⟹ 𝑡 ∈ [𝑏, 𝑐] (uniform) 𝑦 1 𝑦 2 𝜀 = 0.2 Online machine learning with decision trees Max Halford 21 / 46 Choose 𝜀 ∈ [0, 1 𝜀 = 1 2 ⟹ 𝑡 = 𝑏+𝑐 2 (centered)
Some examples 𝑦 1 𝑦 2 𝜀 = 0.1 𝑦 1 𝜀 = 0.25 𝑦 1 𝜀 = 0.4 Online machine learning with decision trees Max Halford 22 / 46
Banana dataset with 𝜀 = 0.2 𝑦 2 Max Halford Online machine learning with decision trees 10 trees 𝑦 1 1.0 0.8 0.6 0.4 0.2 0.0 Single tree 1.0 0.0 0.8 0.6 0.4 0.2 0.0 𝑦 1 1.0 0.8 0.6 0.4 0.2 23 / 46
Impact of 𝛿 on performance 0.55 Max Halford Online machine learning with decision trees Height = 9 Height = 7 Height = 5 Height = 3 Height = 1 log loss 0.65 0.60 0.50 0.0 0.45 0.40 0.35 0.30 𝜀 0.5 0.4 0.3 0.2 0.1 24 / 46
Tree regularisation A decision tree overfits when it’s leaves contain too few samples There are many popular ways to regularise trees: 1. Set a lower limit on the number of samples in each leaf 2. Limit the maximum depth 3. Discard irrelevant nodes aħter training (pruning) None of these are designed to take into account the streaming aspect of online decision trees Online machine learning with decision trees Max Halford 25 / 46
Hierarchical smoothing Intuition: a leaf doesn’t contain enough samples... but it’s ancestors might! Let 𝐻(𝑦 𝑢 ) be the nodes that go from the root to the leaf for a sample 𝑦 𝑢 Curtailment [ZE01]: use the first node in 𝐻(𝑦 𝑢 ) with at least 𝑙 samples Aggregated Mondrian trees [MGS19] use context weighting trees Online machine learning with decision trees Max Halford 26 / 46
Recommend
More recommend