Big Spatial Data Management on Hadoop and Beyond Ahmed Eldawy Computer Science and Engineering 11/30/18 1
Claudius Ptolemy (AD 90 – AD 168)
Al Idrisi (1099 – 1165)
Cholera cases in the London epidemic of 1854
Cool computer technology..!! Can I use it in my My pleasure. application? Here it is. I have BIG data. I need HELP..!! Oh..!! But, it is not made for me. Can’t make use of it as is
1969 Kindly let me get Kindly let me the technology understand your you have needs
Cool Database mmm…Let me technology..!! check with my Can I use it in my good friends there. HELP..!! I have application? My pleasure. BIG data. Your Here it is. technology is not Oh..!! But, it is not helping me made for me. Can’t make use of it as is
Kindly let me Kindly let me get the understand your technology you needs have
Cool Big Data technology..!! Let me check with Can I use it in my application? my other good friends there. My pleasure. Here it is. Oh..!! But, it’s not made for me. Can’t HELP..!! Again, Sorry, seems like I have BIG data. make use of it as is the DBMS Your technology is technology cannot not helping me scale more
Kindly let me get the technology you have Kindly let me understand your needs
Big Spatial Data Management
Tons of Spatial data out there… Geotagged Pictures Geotagged Microblogs Medical Data Sensor Networks Smart Phones Satellite Images Traffic Data VGI 11/30/18 26
➔ SpatialHadoop Spatial Data & Hadoop Spatial Data Hadoop SpatialHadoop points = LOAD ’points’ AS points = LOAD ’points’ AS (id:int, x:int, y:int); (id:int, location:point); result = FILTER points BY result = FILTER points BY x < xmax AND x >= xmin AND Overlap(location, rectangle y < ymax AND y >= ymin; (xmin, ymin, xmax, ymax)); Takes 193 seconds Finishes in 2 seconds 11/30/18 27
Spatial Operations Visualization Spatial Language Spatial Indexes 80,000 downloads Conducted more than seven in one year keynotes, tutorials, and invited talks Industry Academia Students Projects Collaboration >500GB public datasets for benchmarking and testing Incubated by Eclipse Foundation and renamed to University GeoJinni of Genova 11/30/18 28
The Built-in Approach of SpatialHadoop The On-top From Scratch The Built-in Approach Approach Approach (SpatialHadoop) (Spatial) Spatial Modules User Program + User Programs MapReduce Spatial User Programs Language APIs Hadoop Pig Java + Latin Pig Hadoop APIS Job Spatial Latin Java APIS Monitoring Operators Job Monitoring and and Scheduling Job Monitoring Scheduling + and Scheduling MapReduce Early MapReduce Pruning Runtime Runtime MapReduce + Runtime Spatial Storage Storage (HDFS) Indexing + Storage (HDFS) … 11/30/18 29
Agenda The ecosystem of SpatialHadoop Motivation Internal system design Applications Related work Experiments Open Research Problems 11/30/18 30
SpatialHadoop Architecture Applications: SHAHED [ICDE’15] – MNTG [SSTD’13, ICDE’14] TAREEG[SIGMOD’14, SIGSPATIAL’14] Language Visualization Pigeon [ICDE’14] [VLDB’15, ICDE’16 ] VLDB’13 ST-Hadoop Basic operations – CG_Hadoop ICDE’15 Operations [SIGSPATIAL’13] Spatial File Splitter MapReduce Spatial Record Reader Grid – R-tree – R+-tree – Quad tree Indexing [VLDB’15] 11/30/18 31
Indexing Applications: SHAHED [ICDE’15] – MNTG [SSTD’13, ICDE’14] TAREEG[SIGMOD’14, SIGSPATIAL’14] Language Visualization Pigeon [ICDE’14] [VLDB’15, ICDE’16 ] VLDB’13 ST-Hadoop Basic operations – CG_Hadoop ICDE’15 Operations [SIGSPATIAL’13] Spatial File Splitter MapReduce Spatial Record Reader Grid – R-tree – R+-tree – Quad tree Indexing [VLDB’15] 11/30/18 32
Data Loading in Hadoop Blindly chops down a Input File Data Nodes big file into 128MB chunks 128MB Values of records are not considered 128MB Relevant records are typically assigned to 128MB two different blocks HDFS is too restrictive where files cannot be 128MB modified 11/30/18 33
Two-layer Index Layout Global Indexing Data Nodes Locally Indexed HDFS Bocks Global Index 11/30/18 34
Uniform Grid Works only for uniformly distributed data 11/30/18 35
R-tree Read a sample Bulk load the sample into an R-tree Leaf node capacity C 𝑙. 𝐶 𝐷 = 𝑆 (1 + 𝛽) k: Sample size B: HDFS Block capacity |R|: Input size α: Index overhead Use MBR of leaf nodes as partition boundaries 11/30/18 36
R-tree Read a sample Bulk load the sample into an R-tree Leaf node capacity C 𝑙. 𝐶 𝐷 = 𝑆 (1 + 𝛽) k: Sample size B: HDFS Block capacity |R|: Input size α: Index overhead Use MBR of leaf nodes as partition boundaries Partition the data 11/30/18 37
R-tree Read a sample Bulk load the sample into an R-tree Leaf node capacity C 𝑙. 𝐶 𝐷 = 𝑆 (1 + 𝛽) k: Sample size B: HDFS Block capacity |R|: Input size α: Index overhead Use MBR of leaf nodes as partition boundaries Partition the data Optional: Build R-tree Local indexes 11/30/18 38
R-tree-based Index of a 400 GB road network 11/30/18 39
Non-indexed Heap File 11/30/18 40
MapReduce Layer Applications: SHAHED [ICDE’15] – MNTG [SSTD’13, ICDE’14 ] TAREEG[SIGMOD’14 , SIGSPATIAL’14] Language Visualization ST-Hadoop [TODS ] Pigeon [ICDE’14 ] [VLDB’15 ,ICDE’16 ] VLDB’13 Basic operations – CG_Hadoop ICDE’15 Operations [SIGSPATIAL’13, TSAS ] Spatial File Splitter MapReduce Spatial Record Reader Grid – R-tree – R+-tree – Quad tree Indexing [VLDB’15] Under review Demo paper 11/30/18 41
Map plan – SpatialHadoop Map plan – Hadoop Indexed Input Input Heap Map task File File(s) k,v Number Spatial Record of splits Record Map Split k,v Reader Reader k,v Spatial File … File Splitter Map task Splitter k,v Spatial Record Record Split k,v Map Reader Filter Reader k,v Function 11/30/18 42
Operations Applications: SHAHED [ICDE’15] – MNTG [SSTD’13, ICDE’14] TAREEG[SIGMOD’14, SIGSPATIAL’14] Language Visualization Pigeon [ICDE’14] [VLDB’15, ICDE’16 ] VLDB’13 ST-Hadoop Basic operations – CG_Hadoop ICDE’15 Operations [SIGSPATIAL’13] Spatial File Splitter MapReduce Spatial Record Reader Grid – R-tree – R+-tree – Quad tree Indexing [VLDB’15] 11/30/18 43
Operations Layer Basic Operations : e.g, Range query and KNN Spatial Join Operations Computational geometry operations: e.g., Polygon Union, Voronoi diagram, Delaunay Triangulation, and Convex Hull User-defined operations: e.g., kNN join 11/30/18 44
Range Query Use the global index to prune disjoint partitions Use local indexes to find matching records 11/30/18 45
KNN over Indexed Data First iteration runs as before and result is tested for correctness Answer is incorrect Second iteration processes other blocks that might contain an answer ✓ Answer is correct k=3 11/30/18 46
Spatial Join Partition – Join Join Directly 11/30/18 47
Spatial Join Partition – Join Join Directly Total of 36 overlapping pairs Only 16 overlapping pairs 11/30/18 48
CG_Hadoop 260x Polygon Union Skyline 29x 1x Single Spatial Convex Hull Delaunay Voronoi Machine Hadoop Hadoop Farthest/closest pair Triangulation Diagram 11/30/18 49 A. Eldawy, Y. Li, M. F. Mokbel, R. Janardan . “ CG_Hadoop : Computational Geometry in MapReduce”, ACM SIGSPATIAL’13
Convex Hull Find the minimal convex polygon that contains all points Input Output 11/30/18 50
Convex Hull in CG_Hadoop Hadoop SpatialHadoop Partition Pruning Local hull Global hull 11/30/18 51
Visualization Applications: SHAHED [ICDE’15] – MNTG [SSTD’13, ICDE’14] TAREEG[SIGMOD’14, SIGSPATIAL’14] Language Visualization Pigeon [ICDE’14] [VLDB’15, ICDE’16 ] VLDB’13 ST-Hadoop Basic operations – CG_Hadoop ICDE’15 Operations [SIGSPATIAL’13] Spatial File Splitter MapReduce Spatial Record Reader Grid – R-tree – R+-tree – Quad tree Indexing [VLDB’15] 11/30/18 52
Visualization in HadoopViz The goal of HadoopViz is not to propose new visualization techniques, instead its goal is to scale out existing techniques. Satellite Data Road Network Heat Map Vector Map Admin Boundaries Scatter Plot 11/30/18 53
Heat Map From 2009 to 2014 Month-by-Month 72 Frames × 14 Billion points per frame Total = 1 Trillion points Created in 3 hours on 10 nodes instead of 60 hours 11/30/18 54
Abstract Visualization 2. Create 2. Create Input Partition canvas canvas 1. smooth 3. plot 4. merge 4. merge 1. smooth 3. plot 5. write 4. merge 1. smooth 3. plot Output Image 11/30/18 55
Recommend
More recommend