Random Sampling Florian Schoppmann August 24, 2010 Non-Sequential Sequential Sequential with Reservoir Sequential With Reservoir and Replacement . . . . . . . . . . . .
Sampling Algorithms Input: • List [ 1 . . . N ] • Length of database N (if known) • Length of sample n Output: • Sample [ 1 . . . n ] Non-Sequential Sequential Sequential with Reservoir Sequential With Reservoir and Replacement . . . . . . . . . . . .
Considerations • Online vs. random-access • Sequential vs. non-sequential • Samples for independent categories Desiderata: • Parallelizable • If random access, running time close to O ( n ) • Constant memory Non-Sequential Sequential Sequential with Reservoir Sequential With Reservoir and Replacement . . . . . . . . . . . .
Random Indices 1: m ← 0 2: while m < n do R ← random ( { 1 . . . N } ) 3: if List [ R ] / ∈ Sample then 4: m ← m + 1 5: Sample [ m ] ← List [ R ] 6: N • ca. N ln N − n + 1 iterations in expectation • Space/time trade off in line 4 Non-Sequential Sequential Sequential with Reservoir Sequential With Reservoir and Replacement . . . . . . . . . . . .
Random Remaining Indices 1: for m ← 1 , . . . , n do R ← random ( { 1 . . . N − m + 1 } ) 2: j ← index of R 'th non-null element in List 3: Sample [ m ] ← List [ j ] 4: List [ j ] ← null 5: • Prohibitive running time Θ( nN ) • Modifies List Non-Sequential Sequential Sequential with Reservoir Sequential With Reservoir and Replacement . . . . . . . . . . . .
The Fisher-Yates Shuffle 1: for m ← 1 , . . . , n do R ← random ( { m . . . N } ) 2: Swap List [ m ] and List [ R ] 3: 4: Sample [ 1 . . . n ] ← List [ 1 . . . n ] • Running time Θ( n ) • Modifies List Non-Sequential Sequential Sequential with Reservoir Sequential With Reservoir and Replacement . . . . . . . . . . . .
Probabilistic Sampling 1: for t ← 1 , . . . , N do with probability n N do 2: Append List [ t ] to Sample 3: • Running time Θ( N ) • Only expected sample size n (mean of B ( N , n N ) ) • Standard deviation √ n ( 1 − n / N ) Non-Sequential Sequential Sequential with Reservoir Sequential With Reservoir and Replacement . . . . . . . . . . . .
Selection Sampling 1: m ← 0 2: for t ← 1 , . . . , N do with probability n − m N − t do 3: m ← m + 1 4: Sample [ m ] ← List [ t ] 5: • Running time Θ( N ) • Completely unbiased! Non-Sequential Sequential Sequential with Reservoir Sequential With Reservoir and Replacement . . . . . . . . . . . .
Random Number Generation (Digression) Running time O ( n ) possible by skipping rows? Idea 1: • Let S ∈ { 0 . . . N − n } RV for # rows to skip Pr [ S ≤ s ] = 1 − ( N − n ) s + 1 N s + 1 Idea 2 (Vitter, 1984): • von Neumann’s rejection & “squeeze” method Non-Sequential Sequential Sequential with Reservoir Sequential With Reservoir and Replacement . . . . . . . . . . . .
Vitter (1984) ) n − 1 , N g ( x ) = n 1 − x c = N − n + 1 , ( N N ) n − 1 h ( x ) = n x 1 − ( N N − n + 1 N = 20, n = 5 0,2 c · g ( x ) 0,1 h ( x ) Pr[ S = x ] 0 5 10 15 Non-Sequential Sequential Sequential with Reservoir Sequential With Reservoir and Replacement . . . . . . . . . . . .
Reservoir Sampling 1: Sample [ 1 . . . n ] ← List [ 1 . . . n ] 2: for t ← n + 1 , . . . , N do with probability n t do 3: R ← random ( { 1 . . . n } ) 4: Sample [ R ] ← List [ t ] 5: • Completely unbiased! • O ( n ( 1 + log N n )) by optimizing (Vitter, 1985) Non-Sequential Sequential Sequential with Reservoir Sequential With Reservoir and Replacement . . . . . . . . . . . .
Reservoir, with Replacement 1: for t ← 1 , . . . , N do for i ← 1 , . . . , n do 2: with probability 1 t do 3: Sample [ i ] ← List [ t ] 4: • Completely unbiased! Non-Sequential Sequential Sequential with Reservoir Sequential With Reservoir and Replacement . . . . . . . . . . . .
Bibliography Knuth (1997): The Art of Computer Programming, Vol. 2 Vitter (1984): Faster methods for random sampling Vitter (1985): Random sampling with a reservoir Park, Ostrouchov, Samatova, Geist (2004): Reservoir-Based Random Sampling with Replacement from Data Stream Non-Sequential Sequential Sequential with Reservoir Sequential With Reservoir and Replacement . . . . . . . . . . . .
Recommend
More recommend