http cs246 stanford edu input features
play

http://cs246.stanford.edu Input features: N features: X 1 , X 2 , - PowerPoint PPT Presentation

CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu Input features: N features: X 1 , X 2 , X N A Each X j has domain D j X 1 <v 1 Categorical: C Y= 0.42 D j = {red, blue} X 2


  1. CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu

  2.  Input features:  N features: X 1 , X 2 , … X N A  Each X j has domain D j X 1 <v 1  Categorical: C Y= 0.42 D j = {red, blue} X 2 ∈ {v 2 , v 3 }  Numerical: D j = (0, 10) D F  Y is output variable with domain D Y : F G H I  Categorical: Classification  Numerical: Regression  Task:  Given input data vector x i predict y i 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2

  3.  Decision trees: A  Split the data at each X 1 <v 1 internal node C Y=  Each leaf node 0.42 X 2 <v 2 makes a prediction D F  Lecture today: X 3 <v 4 X 2 <v 5  Binary splits: X j <v F G H I  Numerical attrs.  Regression 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 3

  4.  Input: Example x i  Output: Predicted y i ’ A X 1 <v 1  “Drop” x i down Y= B C 0.42 the tree until it X 2 <v 2 hits a leaf node D E X 3 <v 4 X 2 <v 5  Predict the value F G H I stored in the leaf that x i hits 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 4

  5.  Training dataset D*, |D*|=100 examples # of examples A traversing the edge |D|=90 |D|=10 X 1 <v 1 Y= B C 0.42 |D|=45 |D|=45 X 2 <v 2 D E |D|=25 |D|=30 |D|=20 |D|=15 X 3 <v 4 X 2 <v 5 F G H I 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 5

  6. A  Imagine we are currently B C at some node G D E  Let D G be the data reaches G F G H I  There is a decision we have to make: Do we continue building the tree?  If so, which variable and which value do we use for a split?  If not, how do we make a prediction?  We need to build a “predictor node” 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 6

  7.  Alternative view: + + + + – – + + + + + + – – + + + + + – + + – + + – – – – + + – + – – X 1 + + – – – + + – – + + + + – – + + + + + + + – + + + X 2 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 7

  8.  Requires at least a single pass over the data! 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 8

  9. A  How to split? Pick |D|=90 |D|=10 X 1 <v 1 attribute & value that B C .42 |D|=45 |D|=45 X 2 <v 2 optimizes some criterion D E |D|=25 |D|=30 |D|=20 |D|=15 X 3 <v 4 X 2 <v 5  Classification: F G H I Information Gain  IG(Y|X) = H(Y) – H(Y|X) 𝑛  Entropy: 𝐼 𝑎 = − ∑ 𝑞 𝑘 log 𝑞 𝑘 𝑘=1  Conditional entropy: 𝑛 𝐼 𝑋 | 𝑎 = − ∑ 𝑄 𝑎 = 𝑤 𝑘 𝐼 𝑋 𝑎 = 𝑤 𝑘 𝑘=1  Suppose Z takes m values (v 1 … v m )  H(W|Z=v) ... Entropy of W among the records in which Z has value v 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 9

  10. A  How to split? Pick |D|=90 |D|=10 X 1 <v 1 attribute & value that B C .42 |D|=45 |D|=45 X 2 <v 2 optimizes some criterion D E  Regression: |D|=25 |D|=30 |D|=20 |D|=15 X 3 <v 4 X 2 <v 5  Find split (X i , v) that F G H I creates D, D L , D R : parent, left, right child datasets and maximizes: 𝐸 ⋅ 𝑊𝑊𝑊 𝐸 − 𝐸 𝑀 ⋅ 𝑊𝑊𝑊 𝐸 𝑀 + 𝐸 𝑆 ⋅ 𝑊𝑊𝑊 𝐸 𝑆  For ordered domains sort X i and consider a split between each pair of adjacent values  For categorical X i find best split based on subsets (Breiman’s algorithm) 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 10

  11. A  When to stop? |D|=90 |D|=10 X 1 <v 1  1) When the leaf is “pure” B C .42 |D|=45 |D|=45 X 2 <v 2  E.g., Var(y i ) < ε D E |D|=25 |D|=30 |D|=20 |D|=15 X 3 <v 4 X 2 <v 5  2) When # of examples in F G H I the leaf is too small  E.g., |D| ≤ 10  How to predict?  Predictor:  Regression: Avg. y i of the examples in the leaf  Classification: Most common y i in the leaf 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 11

  12. 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 12

  13. FindBestSplit  Given a large dataset with FindBestSplit hundreds of attributes  Build a decision tree! FindBestSplit FindBestSplit  General considerations:  Tree is small (can keep it memory):  Shallow (~10 levels)  Dataset too large to keep in memory  Dataset too big to scan over on a single machine  MapReduce to the rescue! 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 13

  14. 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 14

  15. P arallel L earner for A ssembling N umerous E nsemble T rees [Panda et al., VLDB ‘09]  A sequence of MapReduce jobs that build a decision tree  Setting:  Hundreds of numerical (discrete & continuous) attributes  Target (class) is numerical: Regression  Splits are binary: X j < v  Decision tree is small enough for each Mapper to keep it in memory  Data too large to keep in memory 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 15

  16. A B C D E F G H I Master Attribute Model metadata Intermediate results Input data FindBestSplit InMemoryGrow 2/29/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 16

  17. A B C D E F G H I  Mapper loads the model and info about which attribute splits to consider  Each mapper sees a subset of the data D*  Mapper “drops” each datapoint to find the appropriate leaf node L  For each leaf node L it keeps statistics about  1) the data reaching L  2) the data in left/right subtree under split S  Reducer aggregates the statistics (1) and (2) and determines the best split for each node 2/29/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 17

  18. A B C D E F G H I  Master FindBestSplit  Monitors everything FindBestSplit (runs multiple MapReduce jobs)  MapReduce Initialization  For each attribute identify values FindBestSplit to be considered for splits  MapReduce FindBestSplit FindBestSplit  MapReduce job to find best split when Hardest part there is too much data to fit in memory  MapReduce InMemoryBuild  Similar to FindBestSplit (but for small data)  Grows an entire sub-tree once the data fits in memory  Model file  A file describing the state of the model 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 18

  19.  Identifies all the attribute values which D need to be considered for splits j  Splits for numerical attributes: X j < v  Would like to consider very possible value v ∈ D  Compute an approximate equi-depth histogram on D*  Idea: Select buckets such that counts per bucket are equal Count for bucket Domain values 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20  Use boundary points of histogram as potential splits  Generates an “attribute metadata” to be loaded in memory by other tasks 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 19

  20. Count in bucket 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Domain values  Goal:  Equal number of elements per bucket ( B buckets total)  Construct by first sorting and then taking B-1 equally-spaced splits 1 2 2 3 4 7 8 9 10 10 10 10 11 11 12 12 14 16 16 18 19 20 20 20  Faster construction: Sample & take equally-spaced splits in the sample  Nearly equal buckets 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 20

  21. A B C D E F G H I  Controls the entire process  Determines the state of the tree and grows it:  Decides if nodes should be split  If there is little data entering a node, runs an InMemory-Build MapReduce job to grow the entire subtree  For larger nodes, launches MapReduce FindBestSplit to find candidates for best split  Collects results from MapReduce jobs and chooses the best split for a node  Updates model 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 21

  22. D  Master keeps two node queues: j  MapReduceQueue (MRQ) D L D R X j < v  Nodes for which D is too large to fit in memory  InMemoryQueue (InMemQ)  Nodes for which the data D in the node fits in memory  The tree will be built in levels A  Epoch by epoch B C D E F G H I 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 22

  23. D  Two MapReduce jobs: j  FindBestSplit: Processes nodes D L D R X j < v from the MRQ  For a given set of nodes S , computes a candidate of good split predicate for each node in S  InMemoryBuild: Processes nodes from the InMemQ  For a given set of nodes S , completes tree induction at nodes in S using the InMemoryBuild algorithm  Start by executing FindBestSplit on full data D* 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 23

Recommend


More recommend