Unsupervised Machine Learning and Data Mining DS 5230 / DS 4420 - Fall 2018 Midterm Review Jan-Willem van de Meent
Review: Frequent Itemsets
Frequent Itemsets • Items = {milk, coke, pepsi, beer, juice} • Baskets B 1 = {m, c, b} B 2 = {m, p, j} B 3 = {m, b} B 4 = {c, j} B 5 = {m, c, b} B 6 = {m, c, b, j} B 7 = {c, b, j} B 8 = {b, c} • Frequent itemsets ( σ ( X ) ≥ 3) : {m}:5, {c}:6, {b}:6, {j}:4, {m,c}: 3, {m,b}:4, {c,b}:5, {c,j}:3, {m,c,b}:3
Example: Confidence and Interest B 1 = {m, c, b} B 2 = {m, p, j} B 3 = {m, b} B 4 = {c, j} B 5 = {m, c, b} B 6 = {m, c, b, j} B 7 = {c, b, j} B 8 = {b, c} Lift = c ( A − → B ) • Association rule: {m} → b , s ( B ) • Confidence = 4/5 • Interest Factor = 1/6 4/5 = 4/30 • Item b appears in 6/8 of the baskets • Rule is not very interesting! adapted from : J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Apriori – Overview All pairs of sets All pairs Count Count that differ by All of items the items the pairs 1 element items from L 1 Filter Filter C 1 Construct C 2 L 2 Construct C 3 L 1 1. Set k = 0 2. Define C 1 as all size 1 item sets 3. While C k +1 is not empty 4. Set k = k + 1 (I/O limited) 5. Scan DB to determine subset L k ⊆ C k with support ≥ s 6. Construct candidates C k +1 by combining (Memory sets in L k that differ by 1 element limited)
FP-Growth – Intuition • Apriori requires one pass for each k • Can we find all frequent item sets in fewer passes over the data? FP-Growth Algorithm : • Pass 1 : Count items with support ≥ s • Sort frequent items in descending order according to count • Pass 2 : Store all frequent itemsets in a frequent pattern tree (FP-tree) • Mine patterns from FP-Tree
FP-Growth vs Apriori Advantages of FP-Growth • Only 2 passes over dataset • Stores “compact” version of dataset • No candidate generation • Faster than A-priori Disadvantages of FP-Growth • The FP-Tree may not be “compact” enough to fit in memory • Used in practice : PFP (a distributed version of FP-growth)
Review: ML Basics
What is Similarity? Can be hard to define, but we know it when we see it.
Similarity Metrics in Machine Learning Regression: Similar points x and x’ should have similar function values f(x) and f(x’) Dimensionality Reduction: Reduce dimension of points x and x’ in a manner that preserves similarities Clustering: Similar points x and x’ should have the same cluster assignments z and z’
Distance Norms s k P Euclidean Distance ( ( x i − y i ) 2 ) i =1 k P Mahattan Distance | x i − y i | i =1 ✓ k ◆ 1 q ( | x i − y i | ) q P Minkowski Distance i =1
Sensitivity to Scaling
Normalization Strategies X − X min Min-Max: X max − X min a + ( X − X min )( b − a ) X max − X min X − µ Z-score: X ∼ N ( µ, σ ): σ X Scaling: X max
Curse of Dimensionality 30% 1.0 p=10 D=10 9% p=3 D=3 0.8 p=2 D=2 0.6 D=1 p=1 Distance 0.4 3% 0.2 0.0 0.0 0.2 0.4 0.6 Fraction of Volume Implication: Estimating similarities is difficult for high-dimensional data
Review: Probability
Bayes' Rule Posterior Likelihood Prior Sum Rule: Product Rule:
<latexit sha1_base64="IrZ8Zgs2wuaHSmKLTUb2mnpgr0=">AF6HicfZRbaxQxFIDT2tW63lp9GVxXyosZaWti9CqR9rKU32FlKJnN2NzbJxCTW8h/8EXEFwX/jL/Bf+NkO+DOZDADQzjnO9ecJWMahNFfxYW7y17j9Yfth9PjJ02crq89PdF4oAsckZ7k6S7EGRgUcG2oYnEkFmKcMTtOLd15/eglK01wcmRsJI4ngo4pwaYUDc96ia8J9euX5+v9KP1aLZ64SauNn1UrYPz1aXfSZaTgoMwhGth3EkzchiZSh4LpJoUFicoEnMCy3AnPQIzvL2fVq2qN4ZMe5MCBIzcxirjk20DoYV2XkmkZGFQ9bCUc9Ahmbwkv/yPrHWag6UTUHaTcdbtJBuOylbMkbZayApw9fL/nbDTYejOIN7ZdA1GQVUS8Ew3KrwlMFICokJ3NQby1EzKyUJLBPyjymM9GgYArknORWaTSyBuWLYqAaELBb4Qm6Tc9mPnXADfoaXNTN9N5pXztqklmBya9dE7uZw3yhJXQTQLdtvm4D7HMblpgpGNySvWmncdHCFgFbhJAKINXMEFpjgtSU5SKoZzxHz+ZkHAZlc0x1xt4lK+9nhgOPctqOykN2MPGyRw6Py7zBFYTjstzTnIJCptc+ft3Rc2U6NtpXehVZU/N+q1DeD7bv6UPp/mtp9F5AkZbPBrPcunFCisjrnq2zBJqO3R1cCygbYNVgT5ZPX9x86MLNycZ6HK3Hzf7u3vVI7iMXqJXaA3FaBvtog/oAB0jgnL0Ff1APzufOl863zrf79DFhcrmBaqtzq+/6U0pHA=</latexit> Expected Values X is a random variable with density p(x) Machine Learning Statistics (distribution implied by X) (explicitly define distribution)
Recommend
More recommend