External Memory Geometric Data Structures Lars Arge Duke University June 27, 2002 Summer School on Massive Datasets
External memory data structures External Memory Geometric Data Structures • Many massive dataset applications involve geometric data (or data that can be interpreted geometrically) – Points, lines, polygons • Data need to be stored in data structures on external storage media such that on-line queries can be answered I/O-efficiently • Data often need to be maintained during dynamic updates • Examples: – Phone: Wireless tracking – Consumer: Buying patterns (supermarket checkout) – Geography: NASA satellites generate 1.2 TB per day Lars Arge 2
External memory data structures Example: LIDAR terrain data • Massive (irregular) point sets (1-10m resolution) • Appalachian Mountains (between 50GB and 5TB) • Need to be queried and updated efficiently Example: Jockey’s ridge (NC cost) Lars Arge 3
External memory data structures Model • Model as previously – N : Elements in structure D – B : Elements per block – M : Elements in main memory Block I/O – T : Output size in searching problems M • Focus on – Worst-case structures – Dynamic structures P – Fundamental structures – Fundamental design techniques Lars Arge 4
External memory data structures Outline • Today: Dimension one – External search trees: B-trees – Techniques/tools * Persistent B-trees (search in the past) * Buffer trees (efficient construction) • Tomorrow: “Dimension 1.5” – Handling intervals/segments (interval stabbing/point location) – Techniques/tools: Logarithmic method, weight-balanced B-trees, global rebuilding • Saturday: Dimension two – Two-dimensional range searching Lars Arge 5
External memory data structures External Search Trees • Binary search tree: – Standard method for search among N elements – We assume elements in leaves Ο (log 2 N ) – Search traces at least one root-leaf path – If nodes stored arbitrarily on disk ÿ Search in Ο (log 2 N ) I/Os Ο N + ÿ Rangesearch in (log 2 ) T I/Os Lars Arge 6
External memory data structures External Search Trees Ο (log 2 B ) Θ ( B ) • BFS blocking: Ο Ο = Ο – Block height (log ) / (log ) (log ) N B N 2 2 B – Output elements blocked þ Ο B N + (log ) Rangesearch in T I/Os B Ο B N + Ο (log T ) • Optimal: space and query ( N ) B B Lars Arge 7
External memory data structures External Search Trees • Maintaining BFS blocking during updates? – Balance normally maintained in search trees using rotations x y y x • Seems very difficult to maintain BFS blocking during rotation – Also need to make sure output (leaves) is blocked! Lars Arge 8
External memory data structures B-trees Θ ( B ) • BFS-blocking naturally corresponds to tree with fan-out • B-trees balanced by allowing node degree to vary – Rebalancing performed by splitting and merging nodes Lars Arge 9
� � External memory data structures (a,b)-tree • T is an ( a , b )-tree ( a 2 and b 2 a -1) (2,4)− tree – All leaves on the same level (contain between a and b elements) – Except for the root, all nodes have degree between a and b – Root has degree between 2 and b (log N ) • ( a , b )-tree uses linear space and has height O a þ Θ Choosing a , b = ( B ) each node/leaf stored in one disk block þ Ο Ο B N + ( N ) space and (log ) query T B B Lars Arge 10
External memory data structures ( a , b )-Tree Insert • Insert: Search and insert element in leaf v v DO v has b+1 elements Split v : + 1 b make nodes v’ and v’’ with ý ü û ú + + ≤ ≥ b 1 b 1 and elements b a 2 2 insert element (ref) in parent(v) (make new root if necessary) v’ v’’ v=parent(v) ý ü û ú + + 1 b 1 b 2 2 Ο • Insert touch (log N ) nodes a Lars Arge 11
External memory data structures ( a , b )-Tree Insert Lars Arge 12
� External memory data structures ( a , b )-Tree Delete • Delete: v Search and delete element from leaf v DO v has a-1 children − 1 a Fuse v with sibling v’: move children of v’ to v delete element (ref) from parent(v) (delete root if necessary) If v has >b (and a+b-1 ) children split v v v=parent(v) ≥ a 2 − 1 Ο • Delete touch (log N ) nodes a Lars Arge 13
External memory data structures ( a , b )-Tree Delete Lars Arge 14
� External memory data structures ( a , b )-Tree (2,3)-tree • ( a,b )-tree properties: – If b=2a-1 one update can insert cause many rebalancing delete operations – If b 2a update only cause O(1) rebalancing operations amortized = 1 – If b>2a 1 rebalancing operations amortized ( ) ( ) O O − a b a 2 * Both somewhat hard to show ( 1 – If b=4a easy to show that update causes rebalance log ) O N a a operations amortized * After split during insert a leaf contains ≅ 4a/2=2a elements * After fuse (and possible split) during delete a leaf contains between ≅ 2a and ≅ 5 a elements 2 Lars Arge 15
External memory data structures ( a , b )-Tree • ( a , b )-tree with leaf parameters a l , b l ( b=4a and b l =4a l ) N – Height (log ) O a a l ( 1 – ) amortized leaf rebalance operations O a l ( 1 – log ) amortized internal node rebalance operations O N ⋅ a a a l Θ ( B ) • B-trees: ( a , b )-trees with a , b = – B-trees with elements in the leaves sometimes called B + -tree • Fan-out k B-tree: Θ – ( k/4 , k )-trees with leaf parameter ( B ) and elements in leaves ≥ Θ 1 c 1 • Fan-out B-tree with c ( ) B – O ( N/B ) space + = + – query (log T ) (log T ) O N O N 1 B B B c B (log ) O N – update B Lars Arge 16
External memory data structures Persistent B-tree • In some applications we are interested in being able to access previous versions of data structure – Databases – Geometric data structures (later) • Partial persistence: – Update current version (getting new version) – Query all versions • We would like to have partial persistent B-tree with – O ( N/B ) space – N is number of updates performed – update (log ) O N B + – query in any version (log T ) O B N B Lars Arge 17
External memory data structures Persistent B-tree • East way to make B-tree partial persistent – Copy structure at each operation – Maintain “version-access” structure (B-tree) update i i+1 i+2 i+3 i i+1 i+2 + • Good (log T ) query in any version, but O B N B – O ( N/B ) I/O update – O ( N 2 /B ) space Lars Arge 18
External memory data structures Persistent B-tree • Idea: – Elements augmented with “existence interval” – Augmented elements stored in one structure – Elements “alive” at “time” t (version t ) form B-tree – Version access structure (B-tree) to access B-tree root at time t Lars Arge 19
External memory data structures Persistent B-tree • Directed acyclic graph with elements in leaves (sinks) – Routing elements in internal nodes • Each element (routing element) and node has existence interval • Nodes alive at time t make up ( B/4 , B )-tree on alive elements • B-tree on all roots (version access structure) þ + Answer query at version t in (log T ) I/Os as in normal B-tree O B N B • Additional invariant: 3 7 – New node (only) contains between and live elements B B 8 8 þ 1 1 1 B B B 8 2 8 O ( N/B ) blocks 7 1 3 B B B B 4 8 8 Lars Arge 20
External memory data structures Persistent B-tree Insert • Search for relevant leaf l and insert new element • If l contains x >B elements: Block overflow – Version split: Mark l dead and create new node v with x alive element > 7 – If : Strong overflow x B 8 < 3 – If : Strong underflow x B 8 ≤ ≤ – If then recursively update parent ( l ): 3 7 B x B 8 8 Delete reference to l and insert reference to v 1 3 7 1 3 7 B B B B B B B B 8 4 8 8 4 8 Lars Arge 21
External memory data structures Persistent B-tree Insert > • Strong overflow ( 7 ) x B 8 < ≤ 3 1 x x – Split v into v’ and v’ with elements each ( B B ) 2 8 2 2 – Recursively update parent ( l ): Delete reference to l and insert reference to v’ and v’’ 7 1 3 1 7 3 7 3 7 1 3 1 B B B B B B B B B B B B B B B B 4 8 8 8 8 4 8 4 8 8 4 8 < 3 • Strong underflow ( x B ) 8 – Merge x elements with y live elements obtained by version split + ≥ on sibling ( 1 ) x y B 2 + ≥ – If 7 then (strong overflow) perform split x y B 8 – Recursively update parent ( l ): Delete two references insert one or two references Lars Arge 22
External memory data structures Persistent B-tree Delete • Search for relevant leaf l and mark element dead < • If l contains 1 alive elements: Block underflow x B 4 – Version split: Mark l dead and create new node v with x alive element < 3 – Strong underflow ( ): x B 8 Merge (version split) and possibly split (strong overflow) – Recursively update parent ( l ): Delete two references insert one or two references 1 1 1 B B B 8 2 8 1 3 7 B B B B 8 4 8 Lars Arge 23
Recommend
More recommend