Probabilistic Counting: from analysis to algorithms to programs - PowerPoint PPT Presentation

Probabilistic Counting: from analysis to algorithms to programs Philippe Flajolet, INRIA, Rocquencourt http://algo.inria.fr/flajolet 1

Give a (large) sequence s over some (large) domain D , s = s 1 s 2 · · · s ℓ , s j ∈ D , View sequence s as a multiset M = m f 1 1 m f 2 2 · · · m f n n . — A. Length := ℓ ; — B. Cardinality := card { s j } ≡ n ; — C. Mice := # elements repeated 1,2,. . . ,10 times; Icebergs := # elem. with relative frequency 1 1 — D. 100 ; ℓ f v > — E. Elephants := # elem. with absolute frequency f v > 200 ; Frequency moments := ( � f r v ) 1/r . — F. Alon, Matias, Szegedy; Bar-Yossef; Indyk; Motwani; RAP@Inria. . . Fl-Martin (1985); Fl (1992); Louchard (1997); Durand-Fl (2003); FlFuGaMe ❀ AofA07, Prodinger, Fill-Janson-Mahmoud-Szpankowski . . . 2

s = s 1 s 2 · · · s ℓ , s j ∈ D . Length can be ℓ ≫ 10 9 . Cardinality can be n ∝ 10 7 . Routers in the range of Terabits/sec ( 10 12 b/s). Google indexes 6 billion pages & prepares to index 100 Petabytes of data ( 10 17 B). Can estimate a few key characteristics, QUICK and EASY 3

Length; Cardinality; Icebergs; Mice; Elephants; Freq. moments. . . Rules of the game • Limited storage : cannot store elements; use ≈ one page of print ≡ 4kB.. • Limited time : proceed online = single pass, read once data. • Allow to estimate rather than compute exactly. Assume hash function h : D � → [ 0, 1 ] scrambles data uniformly : Angel-daemon scenario: n values, replicated and permuted at will, then made into random uniform [ 0, 1 ] . 4

What for? — Network management, worms and viruses, traffic monitoring — Databases: Query optimization = size estimation; also “sketches”. — Document classification (Broder), cf Google, citeseer, . . . — Data mining of web graph, internet graph, etc Traces of attacks: Number of active connections in time slices. (Raw ADSL traffic) (Attack) Incoming/Outgoing flows at 40Gbits/second. Code Red Worm: 0.5GBytes of compressed data per hour (2001). CISCO: in 11 minutes, a worm infected 500,000,000 machines. Left: ADSL FT@Lyon 1.5 × 108 packets [21h–23h]. Right: [Estan-Varghese-Fisk] different incoming/outgoing connections 5

Claims: — High Tech algorithms based on probabilities. — Efficient programs: Produce short algorithms & programs with O ( 10 ) instructions. Gains by factors in the range 100-1000 (!) — No maths, no algorithms! AofA: Symbolic methods and generating functions, complex asymptotics (singularities, saddle-point), limit laws and quasipow- ers, transforms (Mellin), analytic depoisssonization. . . Constants play a crucial rˆ ole. 6

1 APPROXIMATE COUNTING In streaming framework: given s 1 s 2 · · · s ℓ , get length ℓ . Means: maintain an efficient counter of events. The oldest algorithm [Morris CACM:1977]: Counting a large number of events in small memory. First analysis [F . 1985]. Prodinger [1992–4]. 7

Approximate Counting • Information Theory: need log 2 N bits to count till N . • Approximate counting: use log 2 log N + O ( 1 ) for ε –approximation, in relative terms and in probability . How to find an unbounded integer while posing few questions? — Ask if in [1—2], [2—4], [4—8], [8–16], etc? — Conclude by binary search (cost is 2 log 2 n ). = A general paradigm for unbounded search: • Ethernet proceeds by period doubling + randomization. • Wake up procedures for mobile communication [Lavault + ] • Adaptive data structures: e.g., extendible hashing tables. ♥ Approximate Counting 8

Emulate a counter subject to X := X + 1 . C=1 1/2 3/4 1/2 7/8 C=2 1/4 1 C=3 1/8 C=4 1/8 1/16 1/2 1/4 C=5 1/32 Algorithm: Approximate Couting /* binary base */ — Initialize: C := 1 ; — Increment: do C := C + 1 with probability 2 − C ; — Output: 2 C − 2 . Alternate base q → 1 controls cost/accuracy tradeoff. 9

Expect C near log 2 n after n steps, then use only log 2 log n bits. 10 8 6 10 runs of of APCO: value of C ( n = 10 3 ) 4 2 0 200 400 600 800 1000 Theorem: • Basic binary algorithm is unbiased : E n ( 2 C − 2 ) = n . • Accuracy , .i.e., standard error ≡ std-dev. 1 is ∼ 2 . √ n • Asymptotics of distribution is (binary case): “ n ” (− 1 ) k q k ( k − 1 ) /2 e − xq − k � 1 P ( C = ℓ ) ∼ Φ Φ ( x ) := , , 2 ℓ Q ∞ Q k k ≥ 0 where Q k := ( 1 − q )( 1 − q 2 ) · · · ( 1 − q k ) and q = 1 2 for binary case. Count till N using log 2 log N + δ bits, with accuracy ∼ 0.59 · 2 − δ/2 . Beats information theory: 8 bits for counts ≤ 2 16 w/ accuracy ≈ 15 %. 10

Recurrences: P n + 1,ℓ = ( 1 − q ℓ ) P n,ℓ + q ℓ − 1 P n,ℓ − 1 . E n ( 2 C ) = n + 2 , V ( 2 C ) = n ( n + 1 ) /2 [Morris1977]. Symbolic methodology: ( i ) Describe events; ( ii ) translate to generating functions (GFs). An alphabet A with weights for Bernoulli trials. For a language describ- ing an event E , the GF is � � E n z n = P n ( E ) z n E ( z ) ≡ n n a1 a2 a3 a ∈ A αz � → E ⊎ F E ( z ) + F ( z ) � → E ⊙ F E ( z ) × F ( z ) � → b b b 1 2 3 ( 1 − E ( z )) − 1 E ⋆ � → 1 − f = 1 + f + f 2 + · · · ≃ ( f ) ⋆ a ⋆ 1 · b 1 · a ⋆ 2 · b 2 · a ⋆ 3 · b 3 1 11

a1 a2 a3 ( a 1 ) ⋆ b 1 ( a 2 ) ⋆ b 2 ( a 3 ) ⋆ 1 1 1 1 − a 1 b 1 1 − a 2 b 2 b b b 1 2 3 1 − a 3 • Perform probabilistic valuation a j � → q j ; b j � → 1 − q j : q 1 + 2 z 2 H 3 ( z ) = ( 1 − ( 1 − q ) z )( 1 − ( 1 − q 2 ) z )( 1 − ( 1 − q 3 z )) . • Do partial fraction expansion to get exact probabilities. • Do ( 1 − a ) n ≈ e − na to get main approximation: � n � (− 1 ) k q k ( k − 1 ) /2 e − xq k � 1 P ( C = ℓ ) ∼ Φ Φ ( x ) := , , 2 ℓ Q ∞ Q k k ≥ 0 where Q k := ( 1 − q )( 1 − q 2 ) · · · ( 1 − q k ) , and q = 1 2 for binary case. cf F .+Sedgewick, Analytic Combinatorics C.U.P ., 2007. 12

` n/2 ℓ ´ ♣ Dyadic superpositions of models: P n ( C = ℓ ) ∼ Φ . Approximate Counting E(X)-log2(n) Mean ( X ) − log 2 n “ n ” –0.273946 � E n ( C ) ∼ ℓΦ − –0.273948 → 2 ℓ ℓ –0.27395 –0.273952 –0.273954 200 400 600 800 1000 x Real analysis is possible: Knuth 1965, Guibas 1977+, Fill-Mahmoud-Szpankowski- Janson, Robert-Mohamed, . . . • Complex asymptotic methodology: Mellin transform [FlDuGo95, FlSe*] � ∞ f ( x ) x s − 1 dx. f ⋆ ( s ) := 0 Need singularities in complex plane . Mellin: Probabilistic counting, loglog counting + Lempel-Ziv compression [Jacquet- Szpa] + dynamic hashing + tree protocols [Jacquet+] + Quadtries &c. 13

� ∞ Mellin transform f ∗ ( s ) = f ( x ) dx , from real to complex . 0 ♥ Maps asymptotics of f at 0 and + ∞ to singularities of f ⋆ in C : C M C · x α ± s + α . ←→ � c + i ∞ 1 f ⋆ ( s ) x − s ds + Residues. Reason: Inversion theorem 2iπ c − i ∞ ♥ Factorizes harmonic sums: � λ � M f ⋆ ( s ) · λ · f ( µx ) − µ s . → � f ⋆ ( s ) M f ( x2 − k ) For dyadic sums: − → 1 − 2 s ⇒ x − α = e − 2ikπ log 2 x α = 2ikπ/ log 2 = 14

Cultural flashes — Complexity: Morris [1977]: Counting a large number of events in small memory. The power of probabilistic machines & approximation [Freivalds 1977]. — Special functions: Mellin analysis involves partition identities for Dirich- let series. Prodinger has connections with q -hypergeometric functions. x n w n � � (− qx ) n h i q n ( n + 1 ) /2 ( 1 − w ) · · · ( 1 − q n − 1 w ) ( 1 + xq ) · · · ( 1 + xq n + 1 ) = . n ≥ 0 n ≥ 0 — Probability theory: Exponentials of Poisson processes [Yor et al]. � E i q i , where E i ∈ Exp ( 1 ) . i — Communication: The TCP protocol = Additive Increase Multiplica- tive Decrease (AIMD) leads to similar functions [Robert et al, 2001]. Ethernet: Get waiting time for a packet subject to k collisions [Robert]. Ethernet is unstable [Aldous 1986] but tree protocols are stable [Jacquet+]. 15

2 CARDINALITY ESTIMATORS Given stream (read-once sequence), estimate number of distinct elements. — Adaptive sampling — Probabilistic Counting — LogLog Counting 16

2.1 Adaptive Sampling • An algorithm of M. Wegman [1980 + ] that does cardinality estimation for s = s 1 . . . s ℓ and more : Samples uniformly over domains (sets) of multisets = of inde- pendent interest for data bases. • � = straight sampling (by positions). Cf Vitter [TOMS 1985], De- vroye 1986, . . . First analysis [F . 1992]. Louchard [2000]. 17

DataBases: Given � persons, towns � , get geography from demography? − Adaptive Sampling ← Sampling − → ( c � Bettina Speckmann, TU Eindhoven) 18

Sample values (i.e., without multiplicity)? Algorithm: Adaptive Sampling ( without multiplicities ) /* Get a sample of size ≤ m according to distinct values. */ — On overflow: Increase sampling depth and decrease sampling rate = use farther bits to filter. 0 0 c x a s d Sample of size ≤ m : 0 c s d depth d = 0, 1, 2, . . . h(x)=0... s d f h h(x)=00... Analysis makes use of digital trees, generating functions and Mellin transforms. 19

Probabilistic Counting: from analysis to algorithms to programs - PowerPoint PPT Presentation

Probabilistic Counting: from analysis to algorithms to programs Philippe Flajolet, INRIA, Rocquencourt http://algo.inria.fr/flajolet 1 Give a (large) sequence s over some (large) domain D , s = s 1 s 2 s , s j D , View sequence

Counting is Hard: Probabilistically Counting Views at Reddit Krishnan Chandra, Data Engineer

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

44 Days And Counting 44 Days And Counting 2010 World Equestrian Games Overview September 25

Counting Basic 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 of 1 10/02/2003 04:00 PM 1

Counting CS1200, CSE IIT Madras Meghana Nasre April 2, 2020 CS1200, CSE IIT Madras Meghana

Counting CS1200, CSE IIT Madras Meghana Nasre March 26, 2020 CS1200, CSE IIT Madras Meghana

Counting and Probability Whats to come? Counting and Probability Whats to come?

CS 4110 Probabilistic Programming Probabilistic Programming It's not about writing software.

Interative Hybrid Probabilistic Model Counting Steffen Michels, Arjen Hommersom, and Peter Lucas

Counting with automorphisms Lectures for CO 430 / 630 March 24 April 2, 2020 1. Counting

Triangle Counting in Large Sparse Graph Meng-Tsung Tsai r95065@cise.ntu.edu.tw Triangle Counting

Computing Lecture 6b: Step Counting & Activity Recognition Emmanuel Agu Step Counting (How

3/31/14 Counting counting is hard with only 10 fingers How many ways to do X ? X = Choose an

Running Probabilistic Running Probabilistic Running Probabilistic Programs Backwards Programs

Probabilistic Tracking and Probabilistic Tracking and Probabilistic Tracking and Thesis

Probabilistic Computation Lecture 13 BPP vs. PH 1 Recap 2 Recap Probabilistic computation 2

PPP The point-to-point protocol (C) Herbert Haas 2005/03/11 PPP versus SLIP PPP Where

P rt t

CPS 214: Computer Networks CPS 214: Computer Networks Slides by Adolfo Rodriguez Paper

From a Client-Server based e-health Platform to a Pervasive Solution TELEFNICA I+D Date: 4 th

Subscriber line technology Subscriber line is the last mile to the customer Conventionally

Making the Home Network Accountable: tackling TCO through studies of practice Cosener Talk

Communication in an All-IP world - a shared journey towards convergence Marc A. Timmermans

Life with IPv6 Journe Cohabitation IPv4-IPv6 16 Fvrier 2005 Keiichi SHIMA

Probabilistic Counting: from analysis to algorithms to programs - PowerPoint PPT Presentation

Probabilistic Counting: from analysis to algorithms to programs Philippe Flajolet, INRIA, Rocquencourt http://algo.inria.fr/flajolet 1 Give a (large) sequence s over some (large) domain D , s = s 1 s 2 s , s j D , View sequence

Counting is Hard: Probabilistically Counting Views at Reddit Krishnan Chandra, Data Engineer

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

44 Days And Counting 44 Days And Counting 2010 World Equestrian Games Overview September 25

Counting Basic 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 of 1 10/02/2003 04:00 PM 1

Counting CS1200, CSE IIT Madras Meghana Nasre April 2, 2020 CS1200, CSE IIT Madras Meghana

Counting CS1200, CSE IIT Madras Meghana Nasre March 26, 2020 CS1200, CSE IIT Madras Meghana

Counting and Probability Whats to come? Counting and Probability Whats to come?

CS 4110 Probabilistic Programming Probabilistic Programming It's not about writing software.

Interative Hybrid Probabilistic Model Counting Steffen Michels, Arjen Hommersom, and Peter Lucas

Counting with automorphisms Lectures for CO 430 / 630 March 24 April 2, 2020 1. Counting

Triangle Counting in Large Sparse Graph Meng-Tsung Tsai r95065@cise.ntu.edu.tw Triangle Counting

Computing Lecture 6b: Step Counting &amp; Activity Recognition Emmanuel Agu Step Counting (How

3/31/14 Counting counting is hard with only 10 fingers How many ways to do X ? X = Choose an

Running Probabilistic Running Probabilistic Running Probabilistic Programs Backwards Programs

Probabilistic Tracking and Probabilistic Tracking and Probabilistic Tracking and Thesis

Probabilistic Computation Lecture 13 BPP vs. PH 1 Recap 2 Recap Probabilistic computation 2

PPP The point-to-point protocol (C) Herbert Haas 2005/03/11 PPP versus SLIP PPP Where

P rt t

CPS 214: Computer Networks CPS 214: Computer Networks Slides by Adolfo Rodriguez Paper

From a Client-Server based e-health Platform to a Pervasive Solution TELEFNICA I+D Date: 4 th

Subscriber line technology Subscriber line is the last mile to the customer Conventionally

Making the Home Network Accountable: tackling TCO through studies of practice Cosener Talk

Communication in an All-IP world - a shared journey towards convergence Marc A. Timmermans

Life with IPv6 Journe Cohabitation IPv4-IPv6 16 Fvrier 2005 Keiichi SHIMA

Computing Lecture 6b: Step Counting & Activity Recognition Emmanuel Agu Step Counting (How