Midterm Review Jan-Willem van de Meent Review: Frequent Itemsets - PowerPoint PPT Presentation

Unsupervised Machine Learning   and Data Mining DS 5230 / DS 4420 - Fall 2018 Midterm Review Jan-Willem van de Meent

Review:   Frequent Itemsets

Frequent Itemsets • Items = {milk, coke, pepsi, beer, juice} • Baskets B 1 = {m, c, b} B 2 = {m, p, j} B 3 = {m, b} B 4 = {c, j} B 5 = {m, c, b} B 6 = {m, c, b, j} B 7 = {c, b, j} B 8 = {b, c} • Frequent itemsets ( σ ( X ) ≥ 3) : {m}:5, {c}:6, {b}:6, {j}:4, {m,c}: 3,   {m,b}:4, {c,b}:5, {c,j}:3, {m,c,b}:3

Example: Confidence and Interest B 1 = {m, c, b} B 2 = {m, p, j} B 3 = {m, b} B 4 = {c, j} B 5 = {m, c, b} B 6 = {m, c, b, j} B 7 = {c, b, j} B 8 = {b, c} Lift = c ( A − → B ) • Association rule: {m} → b , s ( B ) • Confidence = 4/5 • Interest Factor = 1/6 4/5 = 4/30 • Item b appears in 6/8 of the baskets • Rule is not very interesting! adapted from : J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Apriori – Overview All pairs of sets   All pairs Count Count that differ by   All of items the items the pairs 1 element items from L 1 Filter Filter C 1 Construct C 2 L 2 Construct C 3 L 1 1. Set k = 0 2. Define C 1 as all size 1 item sets 3. While C k +1 is not empty 4. Set k = k + 1 (I/O limited) 5. Scan DB to determine subset L k ⊆ C k   with support ≥ s 6. Construct candidates C k +1 by combining   (Memory   sets in L k that differ by 1 element limited)

FP-Growth – Intuition • Apriori requires one pass for each k • Can we find all frequent item sets   in fewer passes over the data? FP-Growth Algorithm : • Pass 1 : Count items with support ≥ s • Sort frequent items in descending   order according to count • Pass 2 : Store all frequent itemsets   in a frequent pattern tree (FP-tree) • Mine patterns from FP-Tree

FP-Growth vs Apriori Advantages of FP-Growth • Only 2 passes over dataset • Stores “compact” version of dataset • No candidate generation • Faster than A-priori Disadvantages of FP-Growth • The FP-Tree may not be “compact”   enough to fit in memory • Used in practice : PFP (a distributed   version of FP-growth)

Review: ML Basics

What is Similarity? Can be hard to define, but we know it when we see it.

Similarity Metrics in Machine Learning Regression: Similar points x and x’ should   have similar function values f(x) and f(x’) Dimensionality Reduction: Reduce dimension of points x and x’ in a manner that preserves similarities Clustering: Similar points x and x’ should   have the same cluster assignments z and z’

Distance Norms s k P Euclidean Distance ( ( x i − y i ) 2 ) i =1 k P Mahattan Distance | x i − y i | i =1 ✓ k ◆ 1 q ( | x i − y i | ) q P Minkowski Distance i =1

Sensitivity to Scaling

Normalization Strategies X − X min Min-Max: X max − X min a + ( X − X min )( b − a ) X max − X min X − µ Z-score: X ∼ N ( µ, σ ): σ X Scaling: X max

Curse of Dimensionality 30% 1.0 p=10 D=10 9% p=3 D=3 0.8 p=2 D=2 0.6 D=1 p=1 Distance 0.4 3% 0.2 0.0 0.0 0.2 0.4 0.6 Fraction of Volume Implication: Estimating similarities   is difficult for high-dimensional data

Review: Probability

Bayes' Rule Posterior Likelihood Prior Sum Rule: Product Rule:

<latexit sha1_base64="IrZ8Zgs2wuaHSmKLTUb2mnpgr0=">AF6HicfZRbaxQxFIDT2tW63lp9GVxXyosZaWti9CqR9rKU32FlKJnN2NzbJxCTW8h/8EXEFwX/jL/Bf+NkO+DOZDADQzjnO9ecJWMahNFfxYW7y17j9Yfth9PjJ02crq89PdF4oAsckZ7k6S7EGRgUcG2oYnEkFmKcMTtOLd15/eglK01wcmRsJI4ngo4pwaYUDc96ia8J9euX5+v9KP1aLZ64SauNn1UrYPz1aXfSZaTgoMwhGth3EkzchiZSh4LpJoUFicoEnMCy3AnPQIzvL2fVq2qN4ZMe5MCBIzcxirjk20DoYV2XkmkZGFQ9bCUc9Ahmbwkv/yPrHWag6UTUHaTcdbtJBuOylbMkbZayApw9fL/nbDTYejOIN7ZdA1GQVUS8Ew3KrwlMFICokJ3NQby1EzKyUJLBPyjymM9GgYArknORWaTSyBuWLYqAaELBb4Qm6Tc9mPnXADfoaXNTN9N5pXztqklmBya9dE7uZw3yhJXQTQLdtvm4D7HMblpgpGNySvWmncdHCFgFbhJAKINXMEFpjgtSU5SKoZzxHz+ZkHAZlc0x1xt4lK+9nhgOPctqOykN2MPGyRw6Py7zBFYTjstzTnIJCptc+ft3Rc2U6NtpXehVZU/N+q1DeD7bv6UPp/mtp9F5AkZbPBrPcunFCisjrnq2zBJqO3R1cCygbYNVgT5ZPX9x86MLNycZ6HK3Hzf7u3vVI7iMXqJXaA3FaBvtog/oAB0jgnL0Ff1APzufOl863zrf79DFhcrmBaqtzq+/6U0pHA=</latexit> Expected Values X is a random variable with density p(x) Machine Learning Statistics (distribution implied by X) (explicitly define distribution)

Midterm Review Jan-Willem van de Meent Review: Frequent Itemsets - PowerPoint PPT Presentation

Unsupervised Machine Learning and Data Mining DS 5230 / DS 4420 - Fall 2018 Midterm Review Jan-Willem van de Meent Review: Frequent Itemsets Frequent Itemsets Items = {milk, coke, pepsi, beer, juice} Baskets B 1 = {m, c, b} B 2

Midterm Introduction to Web Design Midterm exam on Tuesday, October 22 Midterm Introduction to

61A Lecture 11 Friday, September 21 Midterm 1 Recap 2 Midterm 1 Recap The exam was more

Midterm 2 Review. Midterm format Modular Arithmetic Inverses and GCD Midterm Topics: Notes 6-14.

CS 401 Midterm review Xiaorui Sun 1 Midterm Exam Midterm exam via gradescope : October 16

Midterm Solutions David M. Rocke BIM 105, Fall 2018 David M. Rocke Midterm Solutions November

Announcements Midterm 2 is Thursday The midterm will cover everything since the first midterm up

CSE 115 Introduction to Computer Science I Midterm Midterm will be returned no later than

Midterm review Midterm: what you need to know Everything weve covered thus far (chapters 1

MIDTERM REVIEW NEXT WEDNESDAY (3/27): IN-CLASS MIDTERM CANNOT MAKE IT? If for some special

MIDTERM REVIEW NEXT MONDAY: IN-CLASS MIDTERM CANNOT MAKE IT? If for some special circumstance,

CSE 461 Midterm Review A quick tour of what we have learned so far Midterm Topic Coverage

Midterm 2 Review Midterm Topics Leader Election Consensus Formulation Synchronous

Lecture 18 Logistics HW7 is due on Monday (and topic included in midterm 2) Midterm 2

Midterm Exam October 20th, Thursday 9:30am-10:50am @215 NSC Chapters included in the Midterm

Operating Systems Operating Systems CMPSC 473 CMPSC 473 Midterm 2 Review Midterm 2 Review

Review for Midterm Review for Midterm EES 3310/5310 EES 3310/5310 Global Climate Change Global

Partial Redundancy Elimination CS243 Review Session Full Redundancy x = b + c y = b + c z = b

Self Healing in Streaming Systems #UW Database Day Dec 2nd, 2016 Karthik Ramasamy

Ising model and total positivity Pavel Galashin MIT galashin@mit.edu University of Michigan,

FROM F-THEORY TO DYNAMIC GLSM Physics and Geometry of F-theory (2017), ICTP, Trieste Based on

The Essentials of CAGD Chapter 9: Composite Curves Gerald Farin & Dianne Hansford CRC Press,

Linear Cryptanalysis of MORUS Tomer Ashur, Maria Eichlseder, Martin M. Lauridsen, Ga etan

On Minimum Reload Cost Paths, Tours and Flows Edoardo AMALDI Politecnico of Milano Giulia

Paul Laurain, Image des maths Institut Henri Poincar e, June 22th, 2018 Knots in S 3 and

Sambuz

Useful Links

Newsletter

Mail Us

Midterm Review Jan-Willem van de Meent Review: Frequent Itemsets - PowerPoint PPT Presentation

Unsupervised Machine Learning and Data Mining DS 5230 / DS 4420 - Fall 2018 Midterm Review Jan-Willem van de Meent Review: Frequent Itemsets Frequent Itemsets Items = {milk, coke, pepsi, beer, juice} Baskets B 1 = {m, c, b} B 2

Midterm Introduction to Web Design Midterm exam on Tuesday, October 22 Midterm Introduction to

61A Lecture 11 Friday, September 21 Midterm 1 Recap 2 Midterm 1 Recap The exam was more

Midterm 2 Review. Midterm format Modular Arithmetic Inverses and GCD Midterm Topics: Notes 6-14.

CS 401 Midterm review Xiaorui Sun 1 Midterm Exam Midterm exam via gradescope : October 16

Midterm Solutions David M. Rocke BIM 105, Fall 2018 David M. Rocke Midterm Solutions November

Announcements Midterm 2 is Thursday The midterm will cover everything since the first midterm up

CSE 115 Introduction to Computer Science I Midterm Midterm will be returned no later than

Midterm review Midterm: what you need to know Everything weve covered thus far (chapters 1

MIDTERM REVIEW NEXT WEDNESDAY (3/27): IN-CLASS MIDTERM CANNOT MAKE IT? If for some special

MIDTERM REVIEW NEXT MONDAY: IN-CLASS MIDTERM CANNOT MAKE IT? If for some special circumstance,

CSE 461 Midterm Review A quick tour of what we have learned so far Midterm Topic Coverage

Midterm 2 Review Midterm Topics Leader Election Consensus Formulation Synchronous

Lecture 18 Logistics HW7 is due on Monday (and topic included in midterm 2) Midterm 2

Midterm Exam October 20th, Thursday 9:30am-10:50am @215 NSC Chapters included in the Midterm

Operating Systems Operating Systems CMPSC 473 CMPSC 473 Midterm 2 Review Midterm 2 Review

Review for Midterm Review for Midterm EES 3310/5310 EES 3310/5310 Global Climate Change Global

Partial Redundancy Elimination CS243 Review Session Full Redundancy x = b + c y = b + c z = b

Self Healing in Streaming Systems #UW Database Day Dec 2nd, 2016 Karthik Ramasamy

Ising model and total positivity Pavel Galashin MIT galashin@mit.edu University of Michigan,

FROM F-THEORY TO DYNAMIC GLSM Physics and Geometry of F-theory (2017), ICTP, Trieste Based on

The Essentials of CAGD Chapter 9: Composite Curves Gerald Farin &amp; Dianne Hansford CRC Press,

Linear Cryptanalysis of MORUS Tomer Ashur, Maria Eichlseder, Martin M. Lauridsen, Ga etan

On Minimum Reload Cost Paths, Tours and Flows Edoardo AMALDI Politecnico of Milano Giulia

Paul Laurain, Image des maths Institut Henri Poincar e, June 22th, 2018 Knots in S 3 and

Sambuz

Useful Links

Newsletter

Mail Us

The Essentials of CAGD Chapter 9: Composite Curves Gerald Farin & Dianne Hansford CRC Press,