External Memory Geometric Data Structures Lars Arge Duke University June 28, 2002 Summer School on Massive Datasets
External memory data structures Yesterday Θ 1 c ≥ 1 • Fan-out ( ) B-tree ( ) c B – Degree balanced tree with each node/leaf in O (1) blocks – O ( N/B ) space + (log T ) – O B N I/O query B (log ) – I/O update O N B • Persistent B-tree – Update current version, query all previous versions – B-tree bounds with N number of operations performed • Buffer tree technique – Lazy update/queries using buffers attached to each node ( 1 N – log ) amortized bounds O B B M B N N ( log ) – E.g. used to construct structures in I/Os O M B B B Lars Arge 2
External memory data structures Simplifying Assumption • Model – N : Elements in structure D – B : Elements per block – M : Elements in main memory Block I/O – T : Output size in searching problems M • Assumption – Today (and tomorrow) assume that M>B 2 – Assumption not crucial but simplify P expressions a lot, e.g.: = N N N ( log ) ( log ) O O N M B B B B B Lars Arge 3
External memory data structures Today • “Dimension 1.5” problems: – More complicated problems: Interval stabbing and point location – Looking for same bounds: * O ( N/B ) space + * query (log ) T O B N B * update (log ) O N B = N N N * ( log ) ( log ) construction O O N M B B B B B • Use of tools/techniques discussed yesterday as well as – Logarithmic method – Weight-balanced B-trees – Global rebuilding Lars Arge 4
External memory data structures Interval Management • Problem: – Maintain N intervals with unique endpoints dynamically such that stabbing query with point x can be answered efficiently x • As in (one-dimensional) B-tree case we are interested in – space ( N ) O B – (log ) update O N B + – query (log T ) O B N B Lars Arge 5
External memory data structures Interval Management: Static Solution • Sweep from left to right maintaining persistent B-tree – Insert interval when left endpoint is reached – Delete interval when right endpoint is reached x • Query x answered by reporting all intervals in B-tree at “time” x – ( N ) space O B + – (log ) query T O B N B N – construction using buffer technique ( log ) O N B B (log 2 N ) • Dynamic with insert bound using logarithmic method O B Lars Arge 6
External memory data structures Internal Memory Logarithmic Method Idea • Given (semi-dynamic) structure D on set V – O (log N ) query, O (log N ) delete, O ( N log N ) construction • Logarithmic method: – Partition V into subsets V 0 , V 1 , … V log N , | V i | = 2 i or | V i | = 0 – Build D i on V i .................................. * Delete: O (log N ) 0 1 2 log N 2 2 2 2 * Query: Query each D i ÿ O (log 2 N ) * Insert: Find first empty D i and construct D i out of + þ − = 1 i j i 1 2 2 elements in V 0 , V 1 , … V i -1 = 0 j – O (2 i log 2 i ) construction ÿ O (log N ) per moved element (log 2 N – Element moved O (log N ) times ÿ ) amortized O Lars Arge 7
External memory data structures External Logarithmic Method Idea • Decrease number of subsets V i (log 2 N to log B N to get ) query O .................................. B 0 1 2 log N B B B B B + þ − B < 1 i j i • Problem: Since 1 there are not enough elements in B = 0 j V 0 , V 1 , … V i -1 to build V i • Solution: We allow V i to contain any number of elements ≤ B i þ = < i i V B – Insert: Find first D i such that and construct new j 0 j D i from elements in V 0 , V 1 , … V i − − þ ≥ 1 1 i i V B * We move elements = 0 j j * If D i constructed in O ((| V i |/ B )log B | V i |) = O ( B i -1 log B N ) I/Os every moved element charged O (log B N ) I/Os (log 2 N * Element moved O (log B N ) times ÿ ) amortized O B Lars Arge 8
External memory data structures External Logarithmic Method Idea • Given (semi-dynamic) linear space external data structure with + – I/O query (log ) O B N T B N – ( log ) I/O construction O N B B (– (log ) I/O delete) O N B ý • Linear space dynamic data structure with (log 2 + – I/O query ) T O B N B (log 2 N – I/O insert amortized ) O B (– I/O delete) (log ) O N B • Dynamic interval management + (log 2 – ) I/O query T O B N B (log 2 N ) – I/O insert amortized O B x Lars Arge 9
External memory data structures Internal Interval Tree • Base tree on endpoints – “slab” X v associated with each node v • Interval stored in highest node v where it contains midpoint of X v • Intervals I v associated with v stored in – Left slab list sorted by left endpoint (search tree) – Right slab list sorted by right endpoint (search tree) ÿ Linear space and O (log N ) update (assuming fixed endpoint set) Lars Arge 10
External memory data structures Internal Interval Tree x • Query with x on left side of midpoint of X root – Search left slab list left-right until finding non-stabbed interval – Recurse in left child ÿ O (log N+T ) query bound Lars Arge 11
External memory data structures Externalizing Interval Tree • Natural idea: – Block tree – Use B-tree for slab lists • Number of stabbed intervals in large slab list may be small (or zero) – We can be forced to do I/O in each of O (log N ) nodes Lars Arge 12
External memory data structures Externalizing Interval Tree Θ ( B ) multislab • Idea: ÿ height remains Θ – Decrease fan-out to (log ) ( B ) O N B Θ Θ – ( B ) slabs define ( B ) multislabs – Interval stored in two slab lists (as before) and one multislab list – Intervals in small multislab lists collected in underflow structure – Query answered in v by looking at 2 slab lists and not O (log N ) Lars Arge 13
� External memory data structures External Interval Tree Θ • Base tree: Fan-out ( B ) B-tree on endpoints – Interval stored in highest node v where it contains slab boundary • Each internal node v contains: v Θ – Left slab list for each of slabs ( B ) $m$ blocks Θ – Right slab lists for each of slabs ( B ) Θ – multislab lists ( B ) – Underflow structure • Interval in set I v of intervals associated with v stored in – Left slab list of slab containing left endpoint v Θ ( B ) – Right slab list of slab containing right endpoint – Widest multislab list it spans • If < B intervals in multislab list they are instead stored in underflow B 2 intervals) structure ( ÿ contains Lars Arge 14
External memory data structures External Interval tree • Each leaf contains O ( B ) intervals (unique endpoint assumption) – Stored in one O ( 1 ) block • Slab lists implemented using B-trees + T v – query ( 1 ) O B – Linear space Θ * We may “wasted” a block for each of the ( B ) lists in node Θ N ( ) * But only internal nodes B B • Underflow structure implemented using static structure 2 + = + T T – query (log ) ( 1 ) O B O v v v B B B Θ – Linear space ( B ) ý • Linear space Lars Arge 15
External memory data structures External Interval Tree v $m$ blocks • Query with x – Search down tree for x while in node v reporting all intervals in I v stabbed by x • In node v – Query two slab lists – Report all intervals in relevant multislab lists – Query underflow structure • Analysis: – Visit (log ) nodes O N B – Query slab lists ÿ + (log T ) O B N + T v B – Query multislab lists ( 1 ) O B – Query underflow structure Lars Arge 16
� External memory data structures External Interval Tree • Update (assuming fixed endpoint set – static base tree): – Search for relevant node v Θ (log ) O N ( B ) – Update two slab lists B – Update multislab list or underflow structure • Update of underflow structure in O ( 1 ) I/Os amortized – Maintain update block with B updates – Check of update block adds O ( 1 ) I/Os to query bound – Rebuild structure when B updates have been collected using 2 = 2 B ( log ) ( ) I/Os (Global rebuilding) O B O B B B ý Update in (log ) I/Os amortized O N B Lars Arge 17
External memory data structures External Interval Tree • Note: – Insert may increase number of intervals in underflow structure for same multislab to B – Delete may decrease number of intervals in multislab to B ý Need to move B intervals to/from multislab/underflow structure • We only move – intervals from multislab list when decreasing to size B/2 – Intervals to multislab list when increasing to size B ý O ( 1 ) I/Os amortized used to move intervals Lars Arge 18
External memory data structures Removing Fixed Endpoint Assumption • We need to use dynamic base tree – Natural choice is B-tree v • Insertion: – Insert new endpoints and rebalance base tree (using splits) – Insert interval as previously in (log ) I/Os amortized O N B v’ v’’ • Split: Boundary in v becomes boundary in parent ( v ) Lars Arge 19
Recommend
More recommend