3 3 variance and standard deviation recap
play

3.3 Variance and Standard Deviation recap Anna Karlin Most Slides - PowerPoint PPT Presentation

3.3 Variance and Standard Deviation recap Anna Karlin Most Slides by Alex Tsun Agenda Variance Independence of random variables Properties of variance Variance and Standard Deviation (SD) More Useful Random variables and


  1. 3.3 Variance and Standard Deviation recap Anna Karlin Most Slides by Alex Tsun

  2. Agenda ● Variance ● Independence of random variables ● Properties of variance

  3. Variance and Standard Deviation (SD) More Useful

  4. Random variables and independence Random variable X and event E are independent if the event E is independent of the event {X=x} (for any fixed x), i.e. ∀ x P(X = x and E) = P(X=x) • P(E) Two random variables X and Y are independent if the events {X=x} and {Y=y} are independent for any fixed x, y, i.e. ∀ x, y P(X = x and Y=y) = P(X=x) • P(Y=y) Intuition as before: knowing X doesn’t help you guess Y or E and vice versa.

  5. Independent vs dependent r.v.s ● Dependent r.v.s can reinforce/cancel/correlate in arbitrary ways. ● Independent r.v.s are, well, independent. Example: Z = X 1 + X 2 +…. + X n X i is indicator r.v. with probability 1/2 of being 1. versus W = n X 1

  6. Important facts about independent random variables Theorem: If X & Y are independent, then E[X•Y] = E[X]•E[Y] Theorem: If X and Y are independent, then Var[X + Y] = Var[X] + Var[Y] Corollary: If X 1 + X 2 + … + X n are mutually independent then Var[X 1 + X 2 + … + X n ] = Var[X 1 ] + Var [X 2 ] + … + Var[X n ]

  7. E[XY] for independent random variables ● Theorem: If X & Y are independent, then E[X•Y] = products of independent r.v.s E[X]•E[Y] ● Proof: independence Note: NOT true in general; see earlier example E[X 2 ] ≠ E[X] 2 ! X

  8. Variance of a sum of independent r.v.s variance of independent r.v.s is additive Theorem: If X and Y are independent, then ( Bienaymé, 1853) Var[X + Y] = Var[X] + Var[Y] Proof: ! X

  9. Probability Alex Tsun Joshua Fan

  10. Bloom Filters Anna Karlin Most slides by Shreya Jayaraman, Luxi Wang, Alex Tsun

  11. Hashing

  12. Basic Problem Problem: Store a subset 𝑇 of a large set 𝑉 . Example. 𝑉 = set of 128 bit strings 𝑉 ≈ 2 128 𝑇 = subset of strings of interest 𝑇 ≈ 1000 Two goals: Constant-time answering of queries “Is 𝑦 ∈ 𝑇? ” 1. 2. Minimize storage requirements. 13

  13. Naïve Solution – Constant Time A 𝑦 = #1 if 𝑦 ∈ 𝑇 Idea: Represent 𝑇 as an array 𝐵 with 2 128 entries. 0 if 𝑦 ∉ 𝑇 𝟏 𝟐 𝟑 … 𝑳 … 𝑇 = {0,2, … , K} 𝟐 𝟏 𝟐 𝟏 𝟐 … 𝟏 𝟏 Membership test: To check . 𝑦 ∈ 𝑇 just check whether A 𝑦 = 1 . 👎 😁 → constant time! 👏 😣 Storage: Require storing 2 64 bits, even for small 𝑇. 14

  14. Naïve Solution – Small Storage Idea: Represent 𝑇 as a list with |𝑇| entries. 2 𝑇 = {0, 2, … , 𝐿} 0 … K 👎 😁 Storage: Grows with |𝑇| only Membership test: Check 𝑦 ∈ 𝑇 requires time linear in |𝑇| (Can be made logarithmic by using a tree) 👏 😣 15

  15. Hash Table Idea: Map elements in 𝑇 i nto an array 𝐵 using a hash function Membership test: To check 𝑦 ∈ 𝑇 just check whether 𝐵 𝐢(𝑦) = 𝑦 Storage: 𝑜 elements 1 1 2 3 2 3 4 5 4 5 K-1 K hash function 𝐢: U → [𝑜] 16

  16. Hash Table Challenge 1: Ensure 𝐢 𝒚 ≠ 𝐢 𝒛 f or most 𝑦, 𝑧 ∈ 𝑇 Idea: Map elements in 𝑇 i nto an array 𝐵 using a hash function Membership test: To check 𝑦 ∈ 𝑇 just check whether 𝐵 𝐢(𝑦) = 𝑦 Storage: 𝑜 elements Challenge 2: Ensure 𝑜 = 𝑃(|𝑇| ) 17

  17. Hashing –collisions ● Collisions occur when two elements of set map to the same location in the hash table. ● Common solution: chaining – at each location (bucket) in the table, keep linked list of all elements that hash there. ● Want: hash function that distributes the elements of S well across hash table locations. Ideally uniform distribution!

  18. Summary Hash Tables ● They store the data itself ● With a good hash function, the data is well distributed in the table and lookup times are small. ● However, they need at least as much space as all the data being stored ● E.g. storing strings, or IP addresses or long DNA sequences.

  19. Bloom Filters: Motivation ● Large universe of possible data items. ● Data items are large (say 128 bits or more) ● Hash table is stored on disk or across network, so any lookup is expensive. ● Many (if not nearly all) of the lookups return “Not found”. Altogether, this is bad. You’re wasting a lot of time and space doing lookups for items that aren’t even present.

  20. Bloom Filters: Motivation ● Large universe of possible data items. ● Hash table is stored on disk or in network, so any lookup is expensive. ● Many (if not most) of the lookups return “Not found”. Altogether, this is bad. You’re wasting a lot of time and space doing lookups for items that aren’t even present. Examples: ● Google Chrome: wants to warn you if you’re trying to access a malicious URL. Keep hash table of malicious URLs. ● Network routers: want to track source IP addresses of certain packets, .e.g., blocked IP addresses.

  21. Bloom Filters: Motivation ● Probabilistic data structure. ● Close cousins of hash tables. ● Ridiculously space efficient ● To get that, make occasional errors, specifically false positives. Typical implementation: only 8 bits per element!

  22. Bloom Filters

  23. Bloom Filters ● Stores information about a set of elements. ● Supports two operations: 1. add(x) - adds x to bloom filter 2. contains(x) - returns true if x in bloom filter, otherwise returns false a. If return false, definitely not in bloom filter. b. If return true, possibly in the structure (some false positives).

  24. Bloom Filters ● Why accept false positives? ○ Speed – both operations very very fast. ○ Space – requires a miniscule amount of space relative to storing all the actual items that have been added. ○ Often just 8 bits per inserted item!

  25. Bloom Filters: Initialization Size of array Number of associated to hash each hash functions function. for each hash function, initialize an empty bit vector of size m

  26. Bloom Filters: Example bloom filter t with m = 5 that uses k = 3 hash functions Index → 0 1 2 3 4 t 1 0 0 0 0 0 t 2 0 0 0 0 0 t 3 0 0 0 0 0

  27. Bloom Filters: Add for each hash function h i h i (x) → result of hash function h i on x

  28. Bloom Filters: Add for each hash h 1 function h i Index into ith bit-vector, at index produced by hash function and set to 1

  29. Bloom Filters: Example bloom filter t with m = 5 that uses k = 3 hash functions add(“thisisavirus.com”) Index → 0 1 2 3 4 t 1 0 0 0 0 0 t 2 0 0 0 0 0 t 3 0 0 0 0 0

  30. Bloom Filters: Example bloom filter t of length m = 5 that uses k = 3 hash functions add(“thisisavirus.com”) h 1 (“thisisavirus.com”) → 2 Index → 0 1 2 3 4 t 1 0 0 1 0 0 t 2 0 0 0 0 0 t 3 0 0 0 0 0

  31. Bloom Filters: Example bloom filter t of length m = 5 that uses k = 3 hash functions add(“thisisavirus.com”) h 1 (“thisisavirus.com”) → 2 h 2 (“thisisavirus.com”) → 1 Index → 0 1 2 3 4 t 1 0 0 1 0 0 t 2 0 1 0 0 0 t 3 0 0 0 0 0

  32. Bloom Filters: Example bloom filter t of length m = 5 that uses k = 3 hash functions add(“thisisavirus.com”) h 1 (“thisisavirus.com”) → 2 h 2 (“thisisavirus.com”) → 1 h 3 (“thisisavirus.com”) → 4 Index → 0 1 2 3 4 t 1 0 0 1 0 0 t 2 0 1 0 0 0 t 3 0 0 0 0 1

  33. Bloom Filters: Contains Returns True if the bit vector for each hash function has bit 1 at index determined by that hash function, otherwise returns False

  34. Bloom Filters: Example bloom filter t with m = 5 that uses k = 3 hash functions contains(“thisisavirus.com”) Index → 0 1 2 3 4 t 1 0 0 1 0 0 t 2 0 1 0 0 0 t 3 0 0 0 0 1

  35. Bloom Filters: Example bloom filter t of length m = 5 that uses k = 3 hash functions contains(“thisisavirus.com”) h 1 (“thisisavirus.com”) → 2 True Index → 0 1 2 3 4 t 1 0 0 1 0 0 t 2 0 1 0 0 0 t 3 0 0 0 0 1

  36. Bloom Filters: Example bloom filter t of length m = 5 that uses k = 3 hash functions contains(“thisisavirus.com”) h 1 (“thisisavirus.com”) → 2 h 2 (“thisisavirus.com”) → 1 True True Index → 0 1 2 3 4 t 1 0 0 1 0 0 t 2 0 1 0 0 0 t 3 0 0 0 0 1

  37. Bloom Filters: Example bloom filter t of length m = 5 that uses k = 3 hash functions contains(“thisisavirus.com”) h 1 (“thisisavirus.com”) → 2 h 2 (“thisisavirus.com”) → 1 h 3 (“thisisavirus.com”) → 4 True True True Index → 0 1 2 3 4 t 1 0 0 1 0 0 t 2 0 1 0 0 0 t 3 0 0 0 0 1

  38. Bloom Filters: Example bloom filter t of length m = 5 that uses k = 3 hash functions contains(“thisisavirus.com”) h 1 (“thisisavirus.com”) → 2 h 2 (“thisisavirus.com”) → 1 h 3 (“thisisavirus.com”) → 4 True True True Since all conditions satisfied, returns True (correctly) Index → 0 1 2 3 4 t 1 0 0 1 0 0 t 2 0 1 0 0 0 t 3 0 0 0 0 1

  39. Bloom Filters: False Positives bloom filter t of length m = 5 that uses k = 3 hash functions add(“totallynotsuspicious.com”) Index → 0 1 2 3 4 t 1 0 0 1 0 0 t 2 0 1 0 0 0 t 3 0 0 0 0 1

Recommend


More recommend