cs535 big data 3 4 2020 week 7 b sangmi lee pallickara
play

CS535 Big Data 3/4/2020 Week 7-B Sangmi Lee Pallickara CS535 Big - PDF document

CS535 Big Data 3/4/2020 Week 7-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA FAQs Lossy Algorithm PART B. GEAR SESSIONS SESSION 2: MACHINE LEARNING FOR BIG DATA Sangmi Lee Pallickara


  1. CS535 Big Data 3/4/2020 Week 7-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA FAQs • Lossy Algorithm PART B. GEAR SESSIONS SESSION 2: MACHINE LEARNING FOR BIG DATA Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535 CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Topics of Todays Class • Programming Assignment #2 Lossy Algorithm • GEAR Session 2. Machine Learning for Big Data • Lecture 2. • Distributed Optimization Problem in Machine Learning Programming Assignment 2 Lossy Counting Algorithm CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Algorithm • Solving frequent element • Divide the incoming stream into buckets of w = 1/ ε • Each buckets are labeled with integer starting from 1 • Motwani, R; Manku, G.S (2002). "Approximate frequency counts over data streams". • Current bucket number = b current VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases : • b current = N/w 346–357 • True frequency of an element e = f e • Data structure • (e,f, Δ ) • e is an element in the stream • f is an integer representing its estimated frequency • Δ is a maximum possible error in f http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 1

  2. CS535 Big Data 3/4/2020 Week 7-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Example (ε = 0.2, w = 1/ε= 5), 1 st bucket ε = 0.2 • When an element arrives w = 1/ε= 5 (5 items per "bucket") • Lookup to see if there is an entry for that element already exists • If there is an entry, increase its frequency f by one bucket 1 bucket 2 bucket 3 bucket 4 • Otherwise, create a new entry of the form (e, f, Δ ) = (e, f, b curren t-1) 1,2,4,3,4 1,2,4,3,4 3,4,5,4,6 3,4,5,4,6 7,3,3,6,1 7,3,3,6,1 1,3,2,4,7 1,3,2,4,7 • When the new elements fill up the bucket • N mod w == 0 [Bucket 1] • Prune elements b current = 1 inserted: 1 2 4 3 4 • (e,f, Δ ) is deleted if f + Δ ≤ b current Insert phase: D (before removing):(x=1;f=1;Δ=0) (x=2;f=1;Δ=0) (x=4;f=2;Δ=0) (x=3;f=1;Δ=0) • When user request a list of item with threshold s Delete phase : delete elements with f + Δ ≤ b current (=1) D (after removing) :(x=4;f=2;Δ=0) • Outputs are items that f ≥ (s- ε )N NOTE : elements with frequencies ≤ 1 are deleted New elements added has maximum count error of 0 CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Example (ε = 0.2, w = 1/ε= 5) , 2 nd bucket Example (ε = 0.2, w = 1/ε= 5) , 3 rd bucket ε = 0.2 ε = 0.2 w = 1/ε= 5 (5 items per "bucket") w = 1/ε= 5 (5 items per "bucket") bucket 1 bucket 2 bucket 3 bucket 4 bucket 1 bucket 2 bucket 3 bucket 4 1,2,4,3,4 1,2,4,3,4 3,4,5,4,6 3,4,5,4,6 7,3,3,6,1 7,3,3,6,1 1,3,2,4,7 1,3,2,4,7 1,2,4,3,4 1,2,4,3,4 3,4,5,4,6 3,4,5,4,6 7,3,3,6,1 7,3,3,6,1 1,3,2,4,7 1,3,2,4,7 [Bucket 2] [Bucket 3] b current = 2 inserted: 3,4,5,4,6 b current = 3 inserted: 7 3 3 6 1 Insert phase: Insert phase: D (before removing) : (x=4;f=4;Δ=0) (x=3;f=1;Δ=1) (x=5;f=1;Δ=1) (x=6;f=1;Δ=1) D (before removing):(x=7;f=1;Δ=2) (x=3;f=2;Δ=2) (x=4;f=4;Δ=0) (x=6;f=1;Δ=2) (x=1;f=1;Δ=2) Delete phase : delete elements with f + Δ ≤ b current (=3) Delete phase : delete elements with f + Δ ≤ b current (=2) • D (after removing) :(x=4;f=4;Δ=0) (x=3;f=2;Δ=2) D (after removing) :(x=4;f=4;Δ=0) NOTE : elements with frequencies ≤ 3 are deleted NOTE : elements with frequencies ≤ 2 are deleted New elements added has maximum count error of 2 New elements added has maximum count error of 1 CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Example (ε = 0.2, w = 1/ε= 5) , 4 th bucket Example ( ε = 0.2, w = 1/ ε = 5 ) , Output ε = 0.2 ε = 0.2 w = 1/ε= 5 (5 items per "bucket") w = 1/ε= 5 (5 items per "bucket") 1,2,4,3,4 3,4,5,4,6 7,3,3,6,1 1,3,2,4,7 bucket 1 bucket 2 bucket 3 bucket 4 1,2,4,3,4 1,2,4,3,4 3,4,5,4,6 3,4,5,4,6 7,3,3,6,1 7,3,3,6,1 1,3,2,4,7 1,3,2,4,7 D :(x=4;f=5; Δ =0) (x=3;f=3; Δ =2) For the threshold s = 0.3 (so far, N=20 ) [Bucket 4] (s- ε ) N = (0.3-0.2) x 20 = 2 b current = 4 inserted: 1 3 2 4 7 Insert phase: There are only two elements available: • D (before removing):(x=4;f=5;Δ=0) (x=3;f=3;Δ=2) (x=1;f=1;Δ=3)(x=2;f=1;Δ=3) (x=7;f=1;Δ=3) Item f estimated f actual 4 5 5 Delete phase : delete elements with f + Δ ≤ b current (=4) D (after removing) :(x=4;f=5;Δ=0) (x=3;f=3;Δ=2) 3 3 5 NOTE : elements with frequencies ≤ 4 are deleted If s = 0.5? New elements added has maximum count error of 3 No element will be returned http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 2

  3. CS535 Big Data 3/4/2020 Week 7-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Infrequent Items are NOT included in D Why does it work? • Lemma 3. • Lemma 1. • If an item e is not included D , then f e ≤ ε × N b current is at a bucket boundary • i.e., the true frequency count of e is less than or equal to ε × N Where the most recently started new bucket The approximate value of b current = ε × N • Case 1. trivial case • If e does not appear in the input stream, then trivially, the entry (e, f, Δ ) was never • Lemma 2. entered into D and hence, (e, f, Δ ) ∉ D • If an entity (e; f; Δ ) is deleted in the delete phase of the algorithm when b current =k then We have then: • The number of occurrences of e (actual count f e ) is less than or equal to k f e = 0 • f e ≤ b current and trivially: f e (= 0) ≤ ε × N is true. CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Lemma 3: continued Lemma 3: continued • Case 2: • Now, according to Lemma 1, • If e was in the input stream, and the entry (e, f, Δ ) is not in the output set D , then (e, f, Δ ) at any bucket boundary b current = ε × N was deleted in some bucket. Since the entry (e, f, Δ ) was deleted at a bucket boundary, therefore, at that time (when (e, f, Δ ) was deleted): Batch 1 Batch 2 Batch 3 f e ≤ b current = ε × N e has not found e (e,f,Δ) deleted (e,f,Δ) is not present • The maximum actual frequency of e is f e = f + Δ • Since Lemma 3 is true, (If (e, f, Δ ) ∉ D , when the algorithm terminates then, the actual frequency of item e : f e ≤ ε × N ) • According to lemma 2, • Because (e, f, Δ ) is deleted in bucket b current , the actual count at that moment • By rules of negation, f e ≤ b current • If the actual frequency of item e : f e > ε × N then, (e, f, Δ ) ∈ D , when the algorithm terminates CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Difference between true frequency count and approximate Lemma 4: continued frequency count • Lemma 4. • Part 2. f e ≤ f + ε × N • If (e, f, Δ ) ∈ D , then: f ≤ f e ≤ f + ε× N Batch 1 Batch 2 Batch 3 e e e e e • Proof. Algorithm keeps exact • Part 1. f ≤ f e count of e during this (e,f,Δ) deleted period • Since the value f (variable in the algorithm) count the item e in the input after the entry (e, f, Δ ) has been inserted in D , and the entry (e, f, Δ ) may have been deleted before, it is obvious that f ≤ f e • The only occurrences of e that the algorithm fails to count are those that appeared prior to the bucket Δ + 1. http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 3

Recommend


More recommend