Case study: d60 Raptor smartAdvisor Jan Neerbek Alexandra Institute
Agenda · d60: A cloud/data mining case · Cloud · Data Mining · Market Basket Analysis · Large data sets · Our solution 2
Alexandra Institute The Alexandra Institute is a non-profit company that works with application- oriented IT research. Focus is pervasive computing, and we activate the business potential of our members and customers through research- based userdriven innovation. 3
The case: d60 · Danish company · A similar products recommendation engine · d60 was outgrowing their servers (late 2010) · They saw a potential in moving to Azure 4
The setup Product Recommendations Internet Webshops Log shopping patterns Do data mining 5
The cloud potential · Elasticity · No upfront server cost · Cheaper licenses · Faster calculations 6
Challenges · No SQL Server Analysis Services (SSAS) · Small compute nodes · Partioned database (50GB) · SQL server ingress/outgress access is slow 7
The cloud Node Node Node Node Node Node Node 8
The cloud and services Node Node Node Node Data layer service Node Messaging Service Node Node 9
Data layer service Data layer · Application specific (schema/layout) service · SQL, table or other · Easy a bottleneck · Can be difficult to scale 10
Messaging service Task Queues · Standard data structure Messaging Service · Build-in ordering (FIFO) · Can be scaled · Good for asynchronous messages 11
12
Data mining Data mining is the use of automated data analysis techniques to uncover relationships among data items Market basket analysis is a data mining technique that discovers co-occurrence relationships among activities performed by specific individuals [about.com/wikipedia.org] 13
Market basket analysis Customer1 Customer2 Customer3 Customer4 Avocado Milk Beef Cereal Milk Diapers Lemons Beer Butter Avocado Beer Beef Potatoes Beer Chips Diapers 14
Market basket analysis Customer1 Customer2 Customer3 Customer4 Avocado Milk Beef Cereal Milk Diapers Lemons Beer Butter Avocado Beer Beef Potatoes Beer Chips Diapers Itemset (Diapers, Beer) occur 50% Frequency threshold parameter Find as many frequent itemsets as possible 15
Market basket analysis Popular effective algorithm: FP-growth Based on data structure FP-tree Requires all data in near-memory Most research in distributed models has been for cluster setups 16
Building the FP-tree (extends the prefix-tree structure) Customer1 Avocado Avocado Milk Butter Butter Potatoes Milk Potatoes 17
Building the FP-tree Customer2 Avocado Milk Diapers Avocado Butter Beer Milk Potatoes 18
Building the FP-tree Customer2 Avocado Milk Diapers Avocado Butter Beer Beer Milk Diapers Potatoes Milk 19
Building the FP-tree Customer2 Avocado Milk Diapers Avocado Butter Beer Beer Milk Diapers Potatoes Milk 20
Building the FP-tree Beef Avocado Butter Beer Beer Milk Diapers Chips Cereal Potatoes Milk Lemon Diapers 21
FP-growth Grows the frequent itemsets, recusively FP-growth(FP-tree tree) { … for-each (item in tree) count =CountOccur(tree,item); if (IsFrequent(count)) { OutputSet(item); sub = tree.GetTree(tree, item); FP-growth(sub); } 22
FP-growth algorithm Divide and Conquer Traverse tree Avocado Beef Butter Beer Beer Milk Diapers Chips Cereal Potatoes Milk Lemon Diapers 23
FP-growth algorithm Divide and Conquer Generate sub-trees Avocado Beef Butter Beer Beer Milk Diapers Chips Cereal Potatoes Milk Lemon Diapers 24
FP-growth algorithm Divide and Conquer Call recursively Avocado Beef Butter Beer Beer Avocado Milk Diapers Chips Cereal Butter Beer Diapers Potatoes Milk Lemon Diapers 25
FP-growth algorithm Memory usage The FP-tree does not fit in local memory; what to do? · Emulate Distributed Shared Memory 26
Distributed Shared Memory? CPU CPU CPU CPU CPU Memory Memory Memory Memory Memory Network Shared Memory · To add nodes is to add memory · Works best in tightly coubled setups, with low-lantency, high-speed networks 27
FP-growth algorithm Memory usage The FP-tree does not fit in local memory; what to do? · Emulate Distributed Shared Memory · Optimize your data structures · Buy more RAM · Get a good idea 28
Get a good idea · Database scans are serial and can be distributed · The list of items used in the recursive calls uniquely determines what part of data we are looking at 29
Get a good idea Avocado Beef Butter Beer Beer Avocado Milk Diapers Chips Cereal Butter Beer Diapers Potatoes Milk Lemon Diapers 30
Get a good idea Avocado Butter, Milk Avocado Butter Beer Diapers Milk Avocado Beer Diapers,Milk These are postfix paths 31
32
Buckets · Use postfix paths for messaging · Working with buckets Transactions Items 33
FP-growth revisited Replaced with postfix FP-growth(FP-tree tree) { … Done in parallel for-each (item in tree) Done in parallel count =CountOccur(tree,item); if (IsFrequent(count)) { OutputSet(item); Done in parallel sub = tree.GetTree(tree, item); FP-growth(sub); } 34
Communication Node Node Data layer Node Node 35
Revised Communication Node Node MQ Data layer Node Node 36
Running FP-growth Distribute buckets Count items (with postfix size=n) Collect counts (per postfix) Call recursive Standard FP-growth 37
Running FP-growth Distribute buckets Count items (with postfix size=n) Collect counts (per postfix) Call recursive Standard FP-growth 38
Collecting what we have learned · Message-driven work, using message-queue · Peer-to-peer for intermediate results · Distribute data for scalability (buckets) · Small messages (list of items) · Allow us to distribute FP-growth 39
Advantages · Configurable work sizes · Good distribution of work · Robust against computer failure · Fast! 40
So what about performance? 04:30:00 04:00:00 03:30:00 03:00:00 Message-driven FP-growth 02:30:00 FP-growth 02:00:00 Total node time 01:30:00 01:00:00 00:30:00 00:00:00 1 2 4 8 41
Thank you! 42
Recommend
More recommend