Spatial Indexing Ramakrishnan/Gehrke Ch. 28 340151 Big Data & Cloud Services (P. Baumann) 1
Applications of Multidimensional Data Geographic Information Systems (GIS) • Geospatial information; service standards by Open Geospatial Consortium (OGC) • Vendors: ESRI, Intergraph, SmallWorld, …, Oracle, …; open - source: Grass, PostGIS, … • All classes of spatial queries and data are common Computer-Aided Design / Manufacturing • spatial objects, ex: surface of airplane fuselage • Range queries and spatial join queries are common Multimedia Databases • Images, video, text, etc. stored and retrieved by content • First converted to feature vector form; high dimensionality • Nearest-neighbor queries are the most common 340151 Big Data & Cloud Services (P. Baumann) 2
Multidimensional Data Point Data • = points in a multidimensional space • Ex: geographic locations; feature vectors extracted from text Region Data • = objects having spatial extent with location and boundary • DB typically uses geometric approximations constructed using line segments, polygons, etc., called vector data What about raster data such as satellite imagery? • = each pixel stores a measured value • For this Chapter we assume vector data; raster data and their operations have specific, rather distinct characteristics 340151 Big Data & Cloud Services (P. Baumann) 3
Multidimensional Queries Point queries Spatial Join queries • "show Bremen" • "Find all cities near a lake“ • "Find all parts that touch the fuselage" Spatial Range queries (in airplane design) • • "Find all hotels within a radius of 5 miles Expensive; join condition involves regions from the conference venue" and proximity! • "Find all cities that lie on the Nile in Egypt" Similarity queries • 50 < age < 55 AND 80K < sal < 90K • "Given a face, find the five most similar • 50 < Lat < 55 AND 80 < Long < 90 faces" Nearest-Neighbor queries …plus aggregation, and more • "Find the 10 cities nearest to Bremen" • "Find the city with population 500,000 or more that is nearest to Kalamazoo, MI" 340151 Big Data & Cloud Services (P. Baumann) 4
First Try: Composite B+ Tree With Emp relation: • sort entries first by age and then by sal • Ex: index on <age, sal> Observation: Composite search key B+ tree linearizes 2-D space 80 Problems: 70 Spatial • spatial proximity lost 60 clusters "Close in nature" should imply "close on disk" 50 40 • Not symmetric in dimensions 30 Consider entries: B+ tree <11, 80>, <12, 10> 20 order <12, 20>, <13, 75> 10 11 12 13 340151 Big Data & Cloud Services (P. Baumann) 5
Second Try: Multiple B+ Trees Query example: select * from R where a 0 < A < a 1 and b 0 < B < b 1 Several conventional indexes: wanted: B B b 1 b 1 read only tuples - read tuple with a 0 <A<a 1 with a 0 <A<a 1 - read tuple with b 0 <B<b 1 b 0 b 0 and b 0 <B<b 1 - intersect a 0 a 1 A a 0 a 1 A Problems: • Selects way too much data • Index space grows with dimensionality 340151 Big Data & Cloud Services (P. Baumann) 6
Wanted: a Multi-Dimensional Index Requirements: • any number of dimensions • Symmetric behavior for all dimensions • supports inserts and deletes gracefully • Ideally, want to support non-point data as well (e.g., lines, shapes) Zillions of approaches and variants in literature • Grid file, Quad/Oct-tree, kdb-tree, space- filling curves, … • Core idea always: spatial clustering of entries on disk we look into R-tree • widely used, in many variants 340151 Big Data & Cloud Services (P. Baumann) 7
The R-Tree R-tree = tree-structured n-D index [Guttman 1984] • Discriminating value of B+-Tree substituted by bounding intervals (bbox) • Index search by bbox, not by exact (polygon) shape Leaf entry = < n-dimensional box, rid > • tightest bounding box for object Non-leaf entry = < n-dim box, ptr to child node > Root of R-Tree • Box covers all boxes in child node (in fact, subtree) 2-D sketch: Y Leaf level X 340151 Big Data & Cloud Services (P. Baumann) 8
Sample R-Tree Leaf entry R1 R4 Index entry R11 R3 R5 R13 R9 R8 Spatial object R14 R10 R12 R7 R18 R17 R6 R16 R19 R15 R2 340151 Big Data & Cloud Services (P. Baumann) 9
Sample R-Tree (contd.) R1 R2 „contains“ R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 R15 R16 R17 R18 R19 340151 Big Data & Cloud Services (P. Baumann) 10
Sample 3D R+-Tree [Wikipedia] 340151 Big Data & Cloud Services (P. Baumann) 11
Search for Objects Overlapping Box Q Current node := root; 1. If current node is non-leaf: for each entry <E, ptr>: if box E overlaps Q then search subtree identified by ptr; 2. If current node is leaf: for each entry <E, rid>: if box E overlaps Q then rid identifies an object that might overlap Q. May have to search several subtrees at each node! (B-tree equality search goes to just one leaf) 340151 Big Data & Cloud Services (P. Baumann) 12
R-Tree Properties Balanced: All leaves at same distance from root • remains balanced on inserts and deletes Nodes can be kept 50% full (except root) • Can choose parameter m <= 50%, and ensure that every node is at least m% full Inexact match: search by object bounding box, not object • Needs (usually inexpensive) exact match step afterwards • Common for all multidimensional access methods Generally good behavior in practice, however not necessarily good worst-case performance • Priority R-Tree: as efficient as R-Tree + worst-case optimal [Arge,de Berg,Haverkort,Yi 2004] 340151 Big Data & Cloud Services (P. Baumann) 13
R-Tree Variants R+ tree: [Sellis,Roussopoulos,Faloutsos 1987] avoid overlap by inserting object into multiple leaves if necessary • single path to leaf • …at cost of redundancy R* tree: [Beckmann,Kriegel,Schneider,Seeger 1990] forced re-inserts to reduce overlap in tree nodes • When node overflows, instead of splitting: - Remove some (say, 30% of the) entries and reinsert them into the tree - Could result in all reinserted entries fitting on some existing pages, avoiding a split 340151 Big Data & Cloud Services (P. Baumann) 14
Summary Index support for multidimensional queries has many applications • GIS, CAD/CAM, …: spatio -temporal, 2..4-D • multimedia indexing, statistical databases: non-spatial dimensions, 3-D..12-D..10,000- D… Fundamental difference between space/time and feature spaces • <4D vs 1000s of dimensions • R-tree worse than sequential scan for 12+ D Main multidimensional query types: • Point and region data • Overlap/containment and nearest-neighbor queries 340151 Big Data & Cloud Services (P. Baumann) 15
Summary (contd.) Many approaches to indexing; R-tree approach widely used in GIS • Overall, works quite well for 2..4-D datasets • Several variants (notably, R+ and R* trees) proposed; widely used Can improve search performance by using a convex polygon to approximate query shape (instead of a bounding box) and testing for polygon-box intersection Issues • For high- dimensional datasets, unless data has good “contrast”, nearest -neighbor may not be well-separated 340151 Big Data & Cloud Services (P. Baumann) 16
Recommend
More recommend