smartadvisor
play

smartAdvisor Jan Neerbek Alexandra Institute Agenda d60: A - PowerPoint PPT Presentation

Case study: d60 Raptor smartAdvisor Jan Neerbek Alexandra Institute Agenda d60: A cloud/data mining case Cloud Data Mining Market Basket Analysis Large data sets Our solution 2 Alexandra Institute The Alexandra Institute is


  1. Case study: d60 Raptor smartAdvisor Jan Neerbek Alexandra Institute

  2. Agenda · d60: A cloud/data mining case · Cloud · Data Mining · Market Basket Analysis · Large data sets · Our solution 2

  3. Alexandra Institute The Alexandra Institute is a non-profit company that works with application- oriented IT research. Focus is pervasive computing, and we activate the business potential of our members and customers through research- based userdriven innovation. 3

  4. The case: d60 · Danish company · A similar products recommendation engine · d60 was outgrowing their servers (late 2010) · They saw a potential in moving to Azure 4

  5. The setup Product Recommendations Internet Webshops Log shopping patterns Do data mining 5

  6. The cloud potential · Elasticity · No upfront server cost · Cheaper licenses · Faster calculations 6

  7. Challenges · No SQL Server Analysis Services (SSAS) · Small compute nodes · Partioned database (50GB) · SQL server ingress/outgress access is slow 7

  8. The cloud Node Node Node Node Node Node Node 8

  9. The cloud and services Node Node Node Node Data layer service Node Messaging Service Node Node 9

  10. Data layer service Data layer · Application specific (schema/layout) service · SQL, table or other · Easy a bottleneck · Can be difficult to scale 10

  11. Messaging service Task Queues · Standard data structure Messaging Service · Build-in ordering (FIFO) · Can be scaled · Good for asynchronous messages 11

  12. 12

  13. Data mining Data mining is the use of automated data analysis techniques to uncover relationships among data items Market basket analysis is a data mining technique that discovers co-occurrence relationships among activities performed by specific individuals [about.com/wikipedia.org] 13

  14. Market basket analysis Customer1 Customer2 Customer3 Customer4 Avocado Milk Beef Cereal Milk Diapers Lemons Beer Butter Avocado Beer Beef Potatoes Beer Chips Diapers 14

  15. Market basket analysis Customer1 Customer2 Customer3 Customer4 Avocado Milk Beef Cereal Milk Diapers Lemons Beer Butter Avocado Beer Beef Potatoes Beer Chips Diapers Itemset (Diapers, Beer) occur 50% Frequency threshold parameter Find as many frequent itemsets as possible 15

  16. Market basket analysis Popular effective algorithm: FP-growth  Based on data structure FP-tree Requires all data in near-memory  Most research in distributed models has been for cluster setups  16

  17. Building the FP-tree (extends the prefix-tree structure) Customer1 Avocado Avocado Milk Butter Butter Potatoes Milk Potatoes 17

  18. Building the FP-tree Customer2 Avocado Milk Diapers Avocado Butter Beer Milk Potatoes 18

  19. Building the FP-tree Customer2 Avocado Milk Diapers Avocado Butter Beer Beer Milk Diapers Potatoes Milk 19

  20. Building the FP-tree Customer2 Avocado Milk Diapers Avocado Butter Beer Beer Milk Diapers Potatoes Milk 20

  21. Building the FP-tree Beef Avocado Butter Beer Beer Milk Diapers Chips Cereal Potatoes Milk Lemon Diapers 21

  22. FP-growth Grows the frequent itemsets, recusively FP-growth(FP-tree tree) { … for-each (item in tree) count =CountOccur(tree,item); if (IsFrequent(count)) { OutputSet(item); sub = tree.GetTree(tree, item); FP-growth(sub); } 22

  23. FP-growth algorithm Divide and Conquer Traverse tree Avocado Beef Butter Beer Beer Milk Diapers Chips Cereal Potatoes Milk Lemon Diapers 23

  24. FP-growth algorithm Divide and Conquer Generate sub-trees Avocado Beef Butter Beer Beer Milk Diapers Chips Cereal Potatoes Milk Lemon Diapers 24

  25. FP-growth algorithm Divide and Conquer Call recursively Avocado Beef Butter Beer Beer Avocado Milk Diapers Chips Cereal Butter Beer Diapers Potatoes Milk Lemon Diapers 25

  26. FP-growth algorithm Memory usage The FP-tree does not fit in local memory; what to do? · Emulate Distributed Shared Memory 26

  27. Distributed Shared Memory? CPU CPU CPU CPU CPU Memory Memory Memory Memory Memory Network Shared Memory · To add nodes is to add memory · Works best in tightly coubled setups, with low-lantency, high-speed networks 27

  28. FP-growth algorithm Memory usage The FP-tree does not fit in local memory; what to do? · Emulate Distributed Shared Memory · Optimize your data structures · Buy more RAM · Get a good idea 28

  29. Get a good idea · Database scans are serial and can be distributed · The list of items used in the recursive calls uniquely determines what part of data we are looking at 29

  30. Get a good idea Avocado Beef Butter Beer Beer Avocado Milk Diapers Chips Cereal Butter Beer Diapers Potatoes Milk Lemon Diapers 30

  31. Get a good idea Avocado Butter, Milk Avocado Butter Beer Diapers Milk Avocado Beer Diapers,Milk These are postfix paths 31

  32. 32

  33. Buckets · Use postfix paths for messaging · Working with buckets Transactions Items 33

  34. FP-growth revisited Replaced with postfix FP-growth(FP-tree tree) { … Done in parallel for-each (item in tree) Done in parallel count =CountOccur(tree,item); if (IsFrequent(count)) { OutputSet(item); Done in parallel sub = tree.GetTree(tree, item); FP-growth(sub); } 34

  35. Communication Node Node Data layer Node Node 35

  36. Revised Communication Node Node MQ Data layer Node Node 36

  37. Running FP-growth Distribute buckets Count items (with postfix size=n) Collect counts (per postfix) Call recursive Standard FP-growth 37

  38. Running FP-growth Distribute buckets Count items (with postfix size=n) Collect counts (per postfix) Call recursive Standard FP-growth 38

  39. Collecting what we have learned · Message-driven work, using message-queue · Peer-to-peer for intermediate results · Distribute data for scalability (buckets) · Small messages (list of items) · Allow us to distribute FP-growth 39

  40. Advantages · Configurable work sizes · Good distribution of work · Robust against computer failure · Fast! 40

  41. So what about performance? 04:30:00 04:00:00 03:30:00 03:00:00 Message-driven FP-growth 02:30:00 FP-growth 02:00:00 Total node time 01:30:00 01:00:00 00:30:00 00:00:00 1 2 4 8 41

  42. Thank you! 42

Recommend


More recommend