Faloutsos & Pavlo CMU SCS 15-415/615 CMU SCS CMU SCS Data mining - detailed outline Carnegie Mellon Univ. • Problem Dept. of Computer Science • Getting the data: Data Warehouses, DataCubes, 15-415/615 – DB Applications OLAP • Supervised learning: decision trees • Unsupervised learning Data Warehousing / Data Mining – association rules (R&G, ch 25 and 26) C. Faloutsos and A. Pavlo Faloutsos/Pavlo CMU-SCS 2 CMU SCS CMU SCS Problem Data Ware-housing Given: multiple data sources First step: collect the data, in a single place (= Data Warehouse) Find: patterns (classifiers, rules, clusters, outliers...) PGH How? How often? NY How about discrepancies / non- sales(p-id, c-id, date, $price) homegeneities? ??? customers( c-id, age, income, ...) SF Faloutsos/Pavlo CMU-SCS 3 Faloutsos/Pavlo CMU-SCS 4 1
Faloutsos & Pavlo CMU SCS 15-415/615 CMU SCS CMU SCS Data Ware-housing Data Ware-housing First step: collect the data, in a single place (= Step 2: collect counts. (DataCubes/OLAP) Data Warehouse) Eg.: How? A: Triggers/Materialized views How often? A: [Art!] How about discrepancies / non- homegeneities? A: Wrappers/Mediators Faloutsos/Pavlo CMU-SCS 5 Faloutsos/Pavlo CMU-SCS 6 CMU SCS CMU SCS OLAP DataCubes Problem: “ is it true that shirts in large sizes sell ‘ color ’ , ‘ size ’ : DIMENSIONS better in dark colors? ” ‘ count ’ : MEASURE C / S S M L TOT C / S S M L TOT φ sales ci-d p-id Size Color $ Red 20 3 5 28 Red 20 3 5 28 size color Blue 3 3 8 14 C10 Shirt L Blue 30 Blue 3 3 8 14 Gray 0 0 5 5 Gray 0 0 5 5 C10 Pants XL Red 50 TOT 23 6 18 47 TOT 23 6 18 47 color; size C20 Shirt XL White 20 ... Faloutsos/Pavlo CMU-SCS 7 Faloutsos/Pavlo CMU-SCS 8 2
Faloutsos & Pavlo CMU SCS 15-415/615 CMU SCS CMU SCS DataCubes DataCubes ‘ color ’ , ‘ size ’ : DIMENSIONS ‘ color ’ , ‘ size ’ : DIMENSIONS ‘ count ’ : MEASURE ‘ count ’ : MEASURE C / S S M L TOT C / S S M L TOT φ φ Red 20 3 5 28 Red 20 3 5 28 size size Blue 3 3 8 14 Blue 3 3 8 14 color color Gray 0 0 5 5 Gray 0 0 5 5 TOT 23 6 18 47 TOT 23 6 18 47 color; size color; size Faloutsos/Pavlo CMU-SCS 9 Faloutsos/Pavlo CMU-SCS 10 CMU SCS CMU SCS DataCubes DataCubes ‘ color ’ , ‘ size ’ : DIMENSIONS ‘ color ’ , ‘ size ’ : DIMENSIONS ‘ count ’ : MEASURE ‘ count ’ : MEASURE C / S S M L TOT C / S S M L TOT φ φ Red 20 3 5 28 Red 20 3 5 28 size size color Blue 3 3 8 14 color Blue 3 3 8 14 Gray 0 0 5 5 Gray 0 0 5 5 TOT 23 6 18 47 TOT 23 6 18 47 color; size color; size Faloutsos/Pavlo CMU-SCS 11 Faloutsos/Pavlo CMU-SCS 12 3
Faloutsos & Pavlo CMU SCS 15-415/615 CMU SCS CMU SCS DataCubes DataCubes ‘ color ’ , ‘ size ’ : DIMENSIONS SQL query to generate DataCube: ‘ count ’ : MEASURE • Naively (and painfully:) C / S S M L TOT select size, color, count(*) φ from sales where p-id = ‘ shirt ’ Red 20 3 5 28 size group by size, color Blue 3 3 8 14 color Gray 0 0 5 5 select size, count(*) TOT 23 6 18 47 from sales where p-id = ‘ shirt ’ color; size group by size DataCube ... Faloutsos/Pavlo CMU-SCS 13 Faloutsos/Pavlo CMU-SCS 14 CMU SCS CMU SCS DataCubes DataCubes SQL query to generate DataCube: DataCube issues: • with ‘ cube by ’ keyword: Q1: How to store them (and/or materialize portions on demand) select size, color, count(*) Q2: Which operations to allow from sales where p-id = ‘ shirt ’ cube by size, color Faloutsos/Pavlo CMU-SCS 15 Faloutsos/Pavlo CMU-SCS 16 4
Faloutsos & Pavlo CMU SCS 15-415/615 CMU SCS CMU SCS DataCubes DataCubes DataCube issues: Q1: How to store a dataCube? Q1: How to store them (and/or materialize portions on demand) A: ROLAP/MOLAP C / S S M L TOT Red 20 3 5 28 Q2: Which operations to allow A: roll-up, drill down, slice, dice Blue 3 3 8 14 Gray 0 0 5 5 [More details: book by Han+Kamber] TOT 23 6 18 47 Faloutsos/Pavlo CMU-SCS 17 Faloutsos/Pavlo CMU-SCS 18 CMU SCS CMU SCS DataCubes DataCubes Q1: How to store a dataCube? Q1: How to store a dataCube? A1: Relational (R-OLAP) A2: Multi-dimensional (M-OLAP) A3: Hybrid (H-OLAP) C / S S M L TOT C / S S M L TOT Color Size count Red 20 3 5 28 Red 20 3 5 28 'all' 'all' 47 Blue 3 3 8 14 Blue 3 3 8 14 Blue 'all' 14 Gray 0 0 5 5 Gray 0 0 5 5 Blue M 3 TOT 23 6 18 47 TOT 23 6 18 47 … Faloutsos/Pavlo CMU-SCS 19 Faloutsos/Pavlo CMU-SCS 20 5
Faloutsos & Pavlo CMU SCS 15-415/615 CMU SCS CMU SCS DataCubes DataCubes Pros/Cons: Pros/Cons: ROLAP strong points: (DSS, Metacube) ROLAP strong points: (DSS, Metacube) • use existing RDBMS technology • scale up better with dimensionality Faloutsos/Pavlo CMU-SCS 21 Faloutsos/Pavlo CMU-SCS 22 CMU SCS CMU SCS DataCubes DataCubes Pros/Cons: Q1: How to store a dataCube MOLAP strong points: (EssBase/hyperion.com) Q2: What operations should we support? • faster indexing (careful with: high-dimensionality; sparseness) HOLAP: (MS SQL server OLAP services) • detail data in ROLAP; summaries in MOLAP Faloutsos/Pavlo CMU-SCS 23 Faloutsos/Pavlo CMU-SCS 24 6
Faloutsos & Pavlo CMU SCS 15-415/615 CMU SCS CMU SCS DataCubes DataCubes Q2: What operations should we support? Q2: What operations should we support? Roll-up C / S S M L TOT C / S S M L TOT φ φ Red 20 3 5 28 Red 20 3 5 28 size size Blue 3 3 8 14 Blue 3 3 8 14 color color Gray 0 0 5 5 Gray 0 0 5 5 TOT 23 6 18 47 TOT 23 6 18 47 color; size color; size Faloutsos/Pavlo CMU-SCS 25 Faloutsos/Pavlo CMU-SCS 26 CMU SCS CMU SCS DataCubes DataCubes Q2: What operations should we support? Q2: What operations should we support? Drill-down Slice C / S S M L TOT C / S S M L TOT φ φ Red 20 3 5 28 Red 20 3 5 28 size size color Blue 3 3 8 14 color Blue 3 3 8 14 Gray 0 0 5 5 Gray 0 0 5 5 TOT 23 6 18 47 TOT 23 6 18 47 color; size color; size Faloutsos/Pavlo CMU-SCS 27 Faloutsos/Pavlo CMU-SCS 28 7
Faloutsos & Pavlo CMU SCS 15-415/615 CMU SCS CMU SCS DataCubes DataCubes Q2: What operations should we support? Q2: What operations should we support? Dice • Roll-up C / S S M L TOT • Drill-down φ Red 20 3 5 28 • Slice size Blue 3 3 8 14 color • Dice Gray 0 0 5 5 • (Pivot/rotate; drill-across; drill-through TOT 23 6 18 47 • top N color; size • moving averages, etc) Faloutsos/Pavlo CMU-SCS 29 Faloutsos/Pavlo CMU-SCS 30 CMU SCS CMU SCS D/W - OLAP - Conclusions Outline • Problem • D/W: copy (summarized) data + analyze • Getting the data: Data Warehouses, DataCubes, • OLAP - concepts: OLAP – DataCube • Supervised learning: decision trees – R/M/H-OLAP servers • Unsupervised learning – ‘ dimensions ’ ; ‘ measures ’ – association rules – (clustering) Faloutsos/Pavlo CMU-SCS 31 Faloutsos/Pavlo CMU-SCS 32 8
Faloutsos & Pavlo CMU SCS 15-415/615 CMU SCS CMU SCS Decision trees - Problem Decision trees Age Chol-level Gender … CLASS-ID • Pictorially, we have 30 150 M + num. attr#2 - - + … (eg., chol-level) + + - + - - + - + - + ?? num. attr#1 (eg., ‘ age ’ ) Faloutsos/Pavlo CMU-SCS 33 Faloutsos/Pavlo CMU-SCS 34 CMU SCS CMU SCS Decision trees Decision trees • and we want to label ‘ ? ’ • so we build a decision tree: ? ? num. attr#2 num. attr#2 - - - - + + (eg., chol-level) (eg., chol-level) + + + + 40 - - + + - - + + - - + + - - + + 50 num. attr#1 (eg., ‘ age ’ ) num. attr#1 (eg., ‘ age ’ ) Faloutsos/Pavlo CMU-SCS 35 Faloutsos/Pavlo CMU-SCS 36 9
Faloutsos & Pavlo CMU SCS 15-415/615 CMU SCS CMU SCS Decision trees Outline • Problem • so we build a decision tree: • Getting the data: Data Warehouses, DataCubes, OLAP • Supervised learning: decision trees age<50 – problem N Y – approach – scalability enhancements chol. <40 + Y • Unsupervised learning N – association rules - – (clustering) ... Faloutsos/Pavlo CMU-SCS 37 Faloutsos/Pavlo CMU-SCS 38 CMU SCS CMU SCS Decision trees Tree building • Typically, two steps: • How? – tree building – tree pruning (for over-training/over-fitting) num. attr#2 - - + (eg., chol-level) + + - - + + - + - + num. attr#1 (eg., ‘ age ’ ) Faloutsos/Pavlo CMU-SCS 39 Faloutsos/Pavlo CMU-SCS 40 10
Faloutsos & Pavlo CMU SCS 15-415/615 CMU SCS CMU SCS - - + + + - - + + - Tree building + Tree building - + • How? • Q1: how to introduce splits along attribute A i • A: Partition, recursively - pseudocode: • Q2: how to evaluate a split? Partition ( Dataset S) if all points in S have same label then return evaluate splits along each attribute A pick best split, to divide S into S1 and S2 Partition(S1); Partition(S2) Faloutsos/Pavlo CMU-SCS 41 Faloutsos/Pavlo CMU-SCS 42 CMU SCS CMU SCS Tree building Tree building • Q1: how to introduce splits along attribute A i • Q1: how to introduce splits along attribute A i • A1: - - • Q2: how to evaluate a split? – for num. attributes: + + + - - + + • binary split, or - + - + • multiple split – for categorical attributes: • compute all subsets (expensive!), or • use a greedy algo Faloutsos/Pavlo CMU-SCS 43 Faloutsos/Pavlo CMU-SCS 44 11
Recommend
More recommend