Big Spatial Data Management on Hadoop and Beyond Ahmed Eldawy Computer Science and Engineering 1
Claudius Ptolemy (AD 90 – AD 168)
Al Idrisi (1099 – 1165)
Cholera cases in the London epidemic of 1854
Cool computer technology..!! Can I use it in my My pleasure. application? Here it is. I have BIG data. I need HELP..!! Oh..!! But, it is not made for me. Can’t make use of it as is
1969 Kindly let me get Kindly let me the technology understand your you have needs
Cool Database mmm…Let me technology..!! check with my Can I use it in my good friends there. HELP..!! I have application? My pleasure. BIG data. Your Here it is. technology is not Oh..!! But, it is not helping me made for me. Can’t make use of it as is
Kindly let me Kindly let me get the understand your technology you needs have
Cool Big Data technology..!! Let me check with Can I use it in my application? my other good friends there. My pleasure. Here it is. Oh..!! But, it’s not made for me. Can’t HELP..!! Again, Sorry, seems like I have BIG data. make use of it as is the DBMS Your technology is technology cannot not helping me scale more
Kindly let me get the technology you have Kindly let me understand your needs
Big Spatial Data Management
Tons of Spatial data out there… Geotagged Pictures Geotagged Microblogs Medical Data Sensor Networks Smart Phones Satellite Images Traffic Data VGI 26
➔ SpatialHadoop Spatial Data & Hadoop Spatial Data Hadoop SpatialHadoop points = LOAD ’points’ AS points = LOAD ’points’ AS (id:int, x:int, y:int); (id:int, location:point); result = FILTER points BY result = FILTER points BY x < xmax AND x >= xmin AND Overlap(location, rectangle y < ymax AND y >= ymin; (xmin, ymin, xmax, ymax)); Takes 193 seconds Finishes in 2 seconds 27
Spatial Operations Visualization Spatial Language Spatial Indexes 80,000 downloads Conducted more than seven in one year keynotes, tutorials, and invited talks Industry Academia Students Projects Collaboration >500GB public datasets for benchmarking and testing Incubated by Eclipse Foundation and renamed to University GeoJinni of Genova 28
The Built-in Approach of SpatialHadoop The On-top From Scratch The Built-in Approach Approach Approach (SpatialHadoop) (Spatial) Spatial Modules User Program + User Programs MapReduce Spatial User Programs Language APIs Hadoop Pig Java + Latin Pig Hadoop APIS Job Spatial Latin Java APIS Monitoring Operators Job Monitoring and and Scheduling Job Monitoring Scheduling + and Scheduling MapReduce Early MapReduce Pruning Runtime Runtime MapReduce + Runtime Spatial Storage Storage (HDFS) Indexing + Storage (HDFS) … 29
Agenda The ecosystem of SpatialHadoop Motivation Internal system design Applications Related work Experiments Open Research Problems 30
SpatialHadoop Architecture Applications: SHAHED [ICDE’15] – MNTG [SSTD’13, ICDE’14] TAREEG[SIGMOD’14, SIGSPATIAL’14] Language Visualization Pigeon [ICDE’14] [VLDB’15, ICDE’16 ] VLDB’13 ST-Hadoop Basic operations – CG_Hadoop ICDE’15 Operations [SIGSPATIAL’13] Spatial File Splitter MapReduce Spatial Record Reader Grid – R-tree – R+-tree – Quad tree Indexing [VLDB’15] 31
Indexing Applications: SHAHED [ICDE’15] – MNTG [SSTD’13, ICDE’14] TAREEG[SIGMOD’14, SIGSPATIAL’14] Language Visualization Pigeon [ICDE’14] [VLDB’15, ICDE’16 ] VLDB’13 ST-Hadoop Basic operations – CG_Hadoop ICDE’15 Operations [SIGSPATIAL’13] Spatial File Splitter MapReduce Spatial Record Reader Grid – R-tree – R+-tree – Quad tree Indexing [VLDB’15] 32
Data Loading in Hadoop Blindly chops down a Input File Data Nodes big file into 128MB chunks 128MB Values of records are not considered 128MB Relevant records are typically assigned to 128MB two different blocks HDFS is too restrictive where files cannot be 128MB modified 33
Two-layer Index Layout Global Indexing Data Nodes Locally Indexed HDFS Bocks Global Index 34
Uniform Grid Works only for uniformly distributed data 35
R-tree Read a sample Bulk load the sample into an R-tree Leaf node capacity C 𝑙. 𝐶 𝐷 = 𝑆 (1 + 𝛽) k: Sample size B: HDFS Block capacity |R|: Input size α: Index overhead Use MBR of leaf nodes as partition boundaries 36
R-tree Read a sample Bulk load the sample into an R-tree Leaf node capacity C 𝑙. 𝐶 𝐷 = 𝑆 (1 + 𝛽) k: Sample size B: HDFS Block capacity |R|: Input size α: Index overhead Use MBR of leaf nodes as partition boundaries Partition the data 37
R-tree Read a sample Bulk load the sample into an R-tree Leaf node capacity C 𝑙. 𝐶 𝐷 = 𝑆 (1 + 𝛽) k: Sample size B: HDFS Block capacity |R|: Input size α: Index overhead Use MBR of leaf nodes as partition boundaries Partition the data Optional: Build R-tree Local indexes 38
R-tree-based Index of a 400 GB road network 39
Non-indexed Heap File 40
MapReduce Layer Applications: SHAHED [ICDE’15] – MNTG [SSTD’13, ICDE’14 ] TAREEG[SIGMOD’14 , SIGSPATIAL’14] Language Visualization ST-Hadoop [TODS ] Pigeon [ICDE’14 ] [VLDB’15 ,ICDE’16 ] VLDB’13 Basic operations – CG_Hadoop ICDE’15 Operations [SIGSPATIAL’13, TSAS ] Spatial File Splitter MapReduce Spatial Record Reader Grid – R-tree – R+-tree – Quad tree Indexing [VLDB’15] Under review Demo paper 41
Map plan – SpatialHadoop Map plan – Hadoop Input Heap Map task Indexed Input File File(s) k,v Number Spatial Record of splits Record Map Split k,v Reader Reader k,v Spatial File … File Splitter Map task Splitter k,v Spatial Record Record Split k,v Map Reader Filter Reader k,v Function 42
Operations Applications: SHAHED [ICDE’15] – MNTG [SSTD’13, ICDE’14] TAREEG[SIGMOD’14, SIGSPATIAL’14] Language Visualization Pigeon [ICDE’14] [VLDB’15, ICDE’16 ] VLDB’13 ST-Hadoop Basic operations – CG_Hadoop ICDE’15 Operations [SIGSPATIAL’13] Spatial File Splitter MapReduce Spatial Record Reader Grid – R-tree – R+-tree – Quad tree Indexing [VLDB’15] 43
Operations Layer Basic Operations : e.g, Range query and KNN Spatial Join Operations Computational geometry operations: e.g., Polygon Union, Voronoi diagram, Delaunay Triangulation, and Convex Hull User-defined operations: e.g., kNN join 44
Range Query Use the global index to prune disjoint partitions Use local indexes to find matching records 45
KNN over Indexed Data First iteration runs as before and result is tested for correctness Answer is incorrect Second iteration processes other blocks that might contain an answer ✓ Answer is correct k=3 46
Spatial Join Partition – Join Join Directly 47
Spatial Join Partition – Join Join Directly Total of 36 overlapping pairs Only 16 overlapping pairs 48
CG_Hadoop 260x Polygon Union Skyline 29x 1x Single Spatial Convex Hull Delaunay Voronoi Machine Hadoop Hadoop Farthest/closest pair Triangulation Diagram 49 A. Eldawy, Y. Li, M. F. Mokbel, R. Janardan . “ CG_Hadoop : Computational Geometry in MapReduce”, ACM SIGSPATIAL’13
Convex Hull Find the minimal convex polygon that contains all points Input Output 50
Convex Hull in CG_Hadoop Hadoop SpatialHadoop Partition Pruning Local hull Global hull 51
Visualization Applications: SHAHED [ICDE’15] – MNTG [SSTD’13, ICDE’14] TAREEG[SIGMOD’14, SIGSPATIAL’14] Language Visualization Pigeon [ICDE’14] [VLDB’15, ICDE’16 ] VLDB’13 ST-Hadoop Basic operations – CG_Hadoop ICDE’15 Operations [SIGSPATIAL’13] Spatial File Splitter MapReduce Spatial Record Reader Grid – R-tree – R+-tree – Quad tree Indexing [VLDB’15] 52
Visualization in HadoopViz The goal of HadoopViz is not to propose new visualization techniques, instead its goal is to scale out existing techniques. Satellite Data Road Network Heat Map Vector Map Admin Boundaries Scatter Plot 53
Heat Map From 2009 to 2014 Month-by-Month 72 Frames × 14 Billion points per frame Total = 1 Trillion points Created in 3 hours on 10 nodes instead of 60 hours 54
Abstract Visualization 2. Create 2. Create Input Partition canvas canvas 1. smooth 3. plot 4. merge 4. merge 1. smooth 3. plot 5. write 4. merge 1. smooth 3. plot Output Image 55
Example: Satellite Data Visualization 1. Smooth : Recover holes 2. Create Canvas : Initialize a 2D Matrix with zeros 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3. Plot : 4. Merge : 5. Write : Update the matrix Matrix addition Generate the image 0 0 0 0 0 0 0 𝟑𝟑 0 0 𝟖 0 0 0 𝟐𝟔 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 56
Recommend
More recommend