Séminaire de Probabilités, Paris June 2010 The Digital T ree: Analysis and Applications Philippe Flajolet, INRIA Rocquencourt Tuesday, June 22, 2010 1
A (finite) tree associated with a (finite) set of words over an alphabet A. Equipped with a randomness model on words, we get a random tree, indexed by the number n of words. Characterize its probabilistic properties, mostly with COMPLEX ANAL YSIS. Tuesday, June 22, 2010 2
1. Digital T rees & Algorithms Tuesday, June 22, 2010 3
infinite tree set of words <--> partial tree word <--> branch Tuesday, June 22, 2010 4
DIGITAL TREE aka “TRIE”:= STOP descent by pruning long one-way branches. ~Only places corresponding to 2+ words (and their immediate descendants) are kept. ~The digital tree is finite as soon as built out of distinct words. E={a..., bba..., bbb...} Tuesday, June 22, 2010 5
TOP-DOWN construction: Set E is separated into E a ,...,E z according to initial letter; continue with next letter... INCREMENTAL construction: start with the empty tree and insert elements of E one after the other... (Split leaves as the need arises.) E={a..., bba..., bbb...} Tuesday, June 22, 2010 6
SUMMARY: Memoryless (Bernoulli) p,q; Markov, CF Tuesday, June 22, 2010 7
Algorithms: 1 - Dictionaries Manage dynamically dictionaries; hope for O(log n) depth? Save space by “factoring” common prefixes; hope for O(n) size? However, worst-case is unbounded... “TRIE”=tree+retrieval (Fredkin, de la Briandais ~1960) Analysis? Tuesday, June 22, 2010 8
n A random trie on n=500 uniform binary sequences; size =741 internal nodes; height=18 Tuesday, June 22, 2010 9
Algorithms: 2 -Hashing Data may be highly structured and share long prefixes. Use a transformation h: W -> W’ called “hashing” (akin to random number generators.) Uniform binary data are meaningful! Analysis? Tuesday, June 22, 2010 10
Algorithms: 3 -Paging Data may be accessible by blocks, e.g., pages on disc. Stop recursion as soon as “b” elements are isolated (standard: b=1). Combine with hashing = get index structure. Index Analysis? ...... Pages Tuesday, June 22, 2010 11
Algorithms: 4-MultiDim Data may be multidimensional & numeric/ geometric. quad-trie Analysis? Tuesday, June 22, 2010 12
Algorithms: 5-Communication Data may be distributed and accessible only via a common channel (network). Everybody speaks at the same time; if noise, then SPLIT according to individual coin flips. ABC tree protocol B AC Analysis? leader - AC A C Tuesday, June 22, 2010 13
2. Expectations Bernoulli vs Poisson models Mellin technology Fluctuations and error terms Tuesday, June 22, 2010 14
S n n (Proof in a “modernized” version follows....) Tuesday, June 22, 2010 15
Algebra... p q [ ] Tuesday, June 22, 2010 16
Algebra... Tuesday, June 22, 2010 17
With S n the expected tree size when the tree contains n elements and S ( x ) the Poisson expectation: S n e − x x n � S ( x ) = n ! . n ≥ 0 The Poisson expectation S ( x ) is like a generating function of { S n } . Go back —“depoissonize”— by Taylor expansion. E.g.: � � n − 1 � � n � 1 − 1 � 1 − 1 p = q = 1 − n � S n = 1 − 2 k 2 k 2 k , 2 . k Many variants are possible and one can justify that (elementary) S n = S ( x ) + small when x = n . Tuesday, June 22, 2010 18
Analysis... The Mellin transform � ∞ f ( x ) x s − 1 dx M f ⋆ ( s ) := f ( x ) � 0 (It exists in strips of C determined by growth of f ( x ) at 0 , + ∞ .) Property 1. Factors harmonic sums : � � � M � λ µ − s · f ⋆ ( x ) . λ f ( µ x ) � ( λ ,µ ) ( λ ,µ ) Property 2. Maps asymptotics of f on singularities of f ⋆ : 1 f ⋆ ≈ f ( x ) ≈ x − s 0 (log x ) m − 1 . = ⇒ ( s − s 0 ) m Proof of P 2 is from Mellin inversion + residues: Z c + i ∞ 1 f ⋆ ( s ) x − s ds . f ( x ) = 2 i π c − i ∞ Tuesday, June 22, 2010 19
Mellin and Tries � 2 k g ( x / 2 k ), with g ( x ) = 1 − (1 + x ) e − x . p = q = 1 / 2 : S ( x ) = k Harmonic sum property: Γ ( s ) �� 2 k 2 ks � S ⋆ ( s ) = · ( s + 1) Γ ( s ) = 1 − 2 1+ s . Mapping properties: S ⋆ exists in − 2 < ℜ ( s ) < − 1. Poles at s k = − 1 + 2 ik π / log 2, for k ∈ Z . Asymptotics of f ( x ) ≈ x − s 0 Location of pole ( s 0 ) � x − σ e i τ log x s 0 = σ + i τ � Tuesday, June 22, 2010 20
Tuesday, June 22, 2010 21
Memoryless sources (I) 1 Correspond to p � = q . Dirichlet series is 1 − p − s − q − s . Theorem (Knuth 1973; Fayolle, F., Hofri 1986, . . . ) Let H := p log p − 1 + q log q − 1 be the entropy. • In the periodic case, log p log q ∈ Q , there are fluctuations in S n . • In the aperiodic case, log p log q �∈ Q : D n ∼ 1 S n ∼ n H log n , and H Philippe Robert & Hanene Mohamed relate this to the periodic/aperiodic dichotomy of renewal theory (2005+). Tuesday, June 22, 2010 22
( pi , e, tan(1), log2, z (3), ...) [Lapidus & van Frankenhuijsen 2006] Tuesday, June 22, 2010 23
3. Distributions Analytic depoissonization & Saddle-points Gaussian laws ... Tuesday, June 22, 2010 24
2 h Text = Throw n balls into 2 h buckets, each of capacity b Tuesday, June 22, 2010 25
E[2 H ] --> Tuesday, June 22, 2010 26
[2001] Tuesday, June 22, 2010 27
DISTRIBUTIONS: size, depth, and path-length Tuesday, June 22, 2010 28
(p=q=1/2) Start with bivariate generating function F(z,u). Analyse log Analyse perturbation near u=1. Use analytic depoissonization Conclude by continuity theorem for characteristic fns. (case of size, p=q=1/2) Tuesday, June 22, 2010 29
Profile of tries, after Szpankowski et al. + Cesaratto-Vallée 2010+ Tuesday, June 22, 2010 30
4. General sources Comparing and sorting real numbers Continued fractions Fundamental intervals... Tuesday, June 22, 2010 31
Comparing numbers & sorting by continued fractions � a � b − c sign = sign( ad − bc ) . d Requires double precision and/or is unstable with floats. (Computational geometry, Knuth’s Metafont,. . . ) � Hakmem Algorithm (Gosper, 1972) 1 1 36 113 113 = 355 = , . 1 1 3 + 3 + 7 + 1 7 + 1 5 16 Theorem (Cl´ ement, F., Vall´ ee 2000+) Sorting with continued fractions : mean path length of trie is K 0 n log n + K 1 n + Q(n) + K 2 + o (1) , + 9(log 2) 2 K 0 = 6 log 2 K 1 = 18 γ log 2 − 72log 2 ζ ′ (2) − 1 , 2 . π 2 π 2 π 2 π 4 and Q ( n ) ≈ n 1 / 4 is equivalent to Riemann Hypothesis . Tuesday, June 22, 2010 32
[Vallée 1997++] (0) (1) View source model in terms of fundamental intervals: w -> p w Revisit the analysis of tries (e.g, size) Mellinize: Tuesday, June 22, 2010 33
Vallée 1997-2001, Baladi-Vallée 2005+, ... For expanding maps T, fundamental intervals are generated by a transfer operator. For binary system (+Markov) and continued fractions, simplifications occur. Tuesday, June 22, 2010 34
...and Nörlund integrals complete the job! Poisson + Mellin = Newton -> Nörlund - = fixed-n model Q.E.D. cf [F . Sedgewick 1995] Tuesday, June 22, 2010 35
5. Other trie algorithms Leader election The tree communication protocol “Patricia” trees Data compression: Lempel-Ziv... Probabilistic counting Quicksort is O(n (log n) 2 )... Tuesday, June 22, 2010 36
ABC B AC Leader election = leftmost boundary of a leader - AC random trie (1/2,1/2). A C Proof: tree decompositions + Mellin... Tuesday, June 22, 2010 37
ABC B AC tree protocol = trie with arrivals - AC A C (non-commutative iteration semigroup) Tuesday, June 22, 2010 38
A curiosity (cf Mellin): = !! - 0.249999999999999999999999999999999999999999999999999 999999999999999999999999999999999999999999999999999999 999999999999999999999999999999999999999999999999999999 9999999999999999999999999999999999999999999999999998211 (= -1/2+10 -211 : there are 208 consecutive nines) Tuesday, June 22, 2010 39
Recommend
More recommend