CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu
Input features: N features: X 1 , X 2 , … X N A Each X j has domain D j X 1 <v 1 Categorical: C Y= 0.42 D j = {red, blue} X 2 ∈ {v 2 , v 3 } Numerical: D j = (0, 10) D F Y is output variable with domain D Y : F G H I Categorical: Classification Numerical: Regression Task: Given input data vector x i predict y i 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2
Decision trees: A Split the data at each X 1 <v 1 internal node C Y= Each leaf node 0.42 X 2 <v 2 makes a prediction D F Lecture today: X 3 <v 4 X 2 <v 5 Binary splits: X j <v F G H I Numerical attrs. Regression 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 3
Input: Example x i Output: Predicted y i ’ A X 1 <v 1 “Drop” x i down Y= B C 0.42 the tree until it X 2 <v 2 hits a leaf node D E X 3 <v 4 X 2 <v 5 Predict the value F G H I stored in the leaf that x i hits 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 4
Training dataset D*, |D*|=100 examples # of examples A traversing the edge |D|=90 |D|=10 X 1 <v 1 Y= B C 0.42 |D|=45 |D|=45 X 2 <v 2 D E |D|=25 |D|=30 |D|=20 |D|=15 X 3 <v 4 X 2 <v 5 F G H I 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 5
A Imagine we are currently B C at some node G D E Let D G be the data reaches G F G H I There is a decision we have to make: Do we continue building the tree? If so, which variable and which value do we use for a split? If not, how do we make a prediction? We need to build a “predictor node” 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 6
Alternative view: + + + + – – + + + + + + – – + + + + + – + + – + + – – – – + + – + – – X 1 + + – – – + + – – + + + + – – + + + + + + + – + + + X 2 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 7
Requires at least a single pass over the data! 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 8
A How to split? Pick |D|=90 |D|=10 X 1 <v 1 attribute & value that B C .42 |D|=45 |D|=45 X 2 <v 2 optimizes some criterion D E |D|=25 |D|=30 |D|=20 |D|=15 X 3 <v 4 X 2 <v 5 Classification: F G H I Information Gain IG(Y|X) = H(Y) – H(Y|X) 𝑛 Entropy: 𝐼 𝑎 = − ∑ 𝑞 𝑘 log 𝑞 𝑘 𝑘=1 Conditional entropy: 𝑛 𝐼 𝑋 | 𝑎 = − ∑ 𝑄 𝑎 = 𝑤 𝑘 𝐼 𝑋 𝑎 = 𝑤 𝑘 𝑘=1 Suppose Z takes m values (v 1 … v m ) H(W|Z=v) ... Entropy of W among the records in which Z has value v 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 9
A How to split? Pick |D|=90 |D|=10 X 1 <v 1 attribute & value that B C .42 |D|=45 |D|=45 X 2 <v 2 optimizes some criterion D E Regression: |D|=25 |D|=30 |D|=20 |D|=15 X 3 <v 4 X 2 <v 5 Find split (X i , v) that F G H I creates D, D L , D R : parent, left, right child datasets and maximizes: 𝐸 ⋅ 𝑊𝑊𝑊 𝐸 − 𝐸 𝑀 ⋅ 𝑊𝑊𝑊 𝐸 𝑀 + 𝐸 𝑆 ⋅ 𝑊𝑊𝑊 𝐸 𝑆 For ordered domains sort X i and consider a split between each pair of adjacent values For categorical X i find best split based on subsets (Breiman’s algorithm) 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 10
A When to stop? |D|=90 |D|=10 X 1 <v 1 1) When the leaf is “pure” B C .42 |D|=45 |D|=45 X 2 <v 2 E.g., Var(y i ) < ε D E |D|=25 |D|=30 |D|=20 |D|=15 X 3 <v 4 X 2 <v 5 2) When # of examples in F G H I the leaf is too small E.g., |D| ≤ 10 How to predict? Predictor: Regression: Avg. y i of the examples in the leaf Classification: Most common y i in the leaf 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 11
2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 12
FindBestSplit Given a large dataset with FindBestSplit hundreds of attributes Build a decision tree! FindBestSplit FindBestSplit General considerations: Tree is small (can keep it memory): Shallow (~10 levels) Dataset too large to keep in memory Dataset too big to scan over on a single machine MapReduce to the rescue! 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 13
2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 14
P arallel L earner for A ssembling N umerous E nsemble T rees [Panda et al., VLDB ‘09] A sequence of MapReduce jobs that build a decision tree Setting: Hundreds of numerical (discrete & continuous) attributes Target (class) is numerical: Regression Splits are binary: X j < v Decision tree is small enough for each Mapper to keep it in memory Data too large to keep in memory 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 15
A B C D E F G H I Master Attribute Model metadata Intermediate results Input data FindBestSplit InMemoryGrow 2/29/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 16
A B C D E F G H I Mapper loads the model and info about which attribute splits to consider Each mapper sees a subset of the data D* Mapper “drops” each datapoint to find the appropriate leaf node L For each leaf node L it keeps statistics about 1) the data reaching L 2) the data in left/right subtree under split S Reducer aggregates the statistics (1) and (2) and determines the best split for each node 2/29/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 17
A B C D E F G H I Master FindBestSplit Monitors everything FindBestSplit (runs multiple MapReduce jobs) MapReduce Initialization For each attribute identify values FindBestSplit to be considered for splits MapReduce FindBestSplit FindBestSplit MapReduce job to find best split when Hardest part there is too much data to fit in memory MapReduce InMemoryBuild Similar to FindBestSplit (but for small data) Grows an entire sub-tree once the data fits in memory Model file A file describing the state of the model 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 18
Identifies all the attribute values which D need to be considered for splits j Splits for numerical attributes: X j < v Would like to consider very possible value v ∈ D Compute an approximate equi-depth histogram on D* Idea: Select buckets such that counts per bucket are equal Count for bucket Domain values 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Use boundary points of histogram as potential splits Generates an “attribute metadata” to be loaded in memory by other tasks 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 19
Count in bucket 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Domain values Goal: Equal number of elements per bucket ( B buckets total) Construct by first sorting and then taking B-1 equally-spaced splits 1 2 2 3 4 7 8 9 10 10 10 10 11 11 12 12 14 16 16 18 19 20 20 20 Faster construction: Sample & take equally-spaced splits in the sample Nearly equal buckets 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 20
A B C D E F G H I Controls the entire process Determines the state of the tree and grows it: Decides if nodes should be split If there is little data entering a node, runs an InMemory-Build MapReduce job to grow the entire subtree For larger nodes, launches MapReduce FindBestSplit to find candidates for best split Collects results from MapReduce jobs and chooses the best split for a node Updates model 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 21
D Master keeps two node queues: j MapReduceQueue (MRQ) D L D R X j < v Nodes for which D is too large to fit in memory InMemoryQueue (InMemQ) Nodes for which the data D in the node fits in memory The tree will be built in levels A Epoch by epoch B C D E F G H I 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 22
D Two MapReduce jobs: j FindBestSplit: Processes nodes D L D R X j < v from the MRQ For a given set of nodes S , computes a candidate of good split predicate for each node in S InMemoryBuild: Processes nodes from the InMemQ For a given set of nodes S , completes tree induction at nodes in S using the InMemoryBuild algorithm Start by executing FindBestSplit on full data D* 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 23
Recommend
More recommend