The Era of Big Spatial Data Ahmed Eldawy Computer Science and Engineering
Claudius Ptolemy (AD 90 – AD 168)
Al Idrisi (1099 – 1165)
Cholera cases in the London epidemic of 1854
Cool computer technology..!! Can I use it in my My pleasure. application? Here it is. I have BIG data. I need HELP..!! Oh..!! But, it is not made for me. Can’t make use of it as is
1969 Kindly let me get Kindly let me the technology understand your you have needs
Cool Database mmm…Let me technology..!! check with my Can I use it in my good friends there. HELP..!! I have application? My pleasure. BIG data. Your Here it is. technology is not Oh..!! But, it is not helping me made for me. Can’t make use of it as is
Kindly let me Kindly let me get the understand your technology you needs have
Cool Big Data technology..!! Let me check with Can I use it in my application? my other good friends there. My pleasure. Here it is. Oh..!! But, it’s not made for me. Can’t HELP..!! Again, Sorry, seems like I have BIG data. make use of it as is the DBMS Your technology is technology cannot not helping me scale more
Kindly let me get the technology you have Kindly let me understand your needs
Big Spatial Data
Tons of Spatial data out there… Geotagged Pictures Geotagged Microblogs Medical Data Sensor Networks Smart Phones Satellite Images Traffic Data VGI
SpatialHadoop Spatial Data & Hadoop Spatial Data Hadoop SpatialHadoop points = LOAD ’points’ AS points = LOAD ’points’ AS (id:int, x:int, y:int); (id:int, location:point); result = FILTER points BY result = FILTER points BY x < xmax AND x >= xmin AND Overlap(location, rectangle y < ymax AND y >= ymin; (xmin, ymin, xmax, ymax)); Takes 193 seconds Finishes in 2 seconds
Spatial Operations Visualization Spatial Language Spatial Indexes 80,000 downloads Conducted more than seven in one year keynotes, tutorials, and invited talks Industry Academia Students Projects Collaboration >500GB public datasets for benchmarking and testing University of Genova
The Built-in Approach of SpatialHadoop The On-top From Scratch The Built-in Approach Approach Approach (SpatialHadoop) (Spatial) Spatial Modules User Program + User Programs MapReduce Spatial User Programs Language APIs Hadoop Pig Java + Latin Pig Hadoop APIS Job Spatial Latin Java APIS Monitoring Operators Job Monitoring and and Scheduling Job Monitoring Scheduling + and Scheduling MapReduce Early MapReduce Pruning Runtime Runtime MapReduce + Runtime Spatial Storage Storage (HDFS) Indexing + Storage (HDFS) …
Agenda The ecosystem of SpatialHadoop Motivation Internal system design Applications Related work Performance results Open Research Problems
SpatialHadoop Architecture Applications: SHAHED [ICDE’15] – MNTG [SSTD’13, ICDE’14] TAREEG[SIGMOD’14, SIGSPATIAL’14] Language Visualization Pigeon [ICDE’14] [VLDB’15, ICDE’16 ] VLDB’13 ST-Hadoop Basic operations – CG_Hadoop ICDE’15 Operations [SIGSPATIAL’13] Spatial File Splitter MapReduce Spatial Record Reader Grid – R-tree – R+-tree – Quad tree Indexing [VLDB’15]
Indexing Applications: SHAHED [ICDE’15] – MNTG [SSTD’13, ICDE’14] TAREEG[SIGMOD’14, SIGSPATIAL’14] Language Visualization Pigeon [ICDE’14] [VLDB’15, ICDE’16 ] VLDB’13 ST-Hadoop Basic operations – CG_Hadoop ICDE’15 Operations [SIGSPATIAL’13] Spatial File Splitter MapReduce Spatial Record Reader Grid – R-tree – R+-tree – Quad tree Indexing [VLDB’15]
Data Loading in Hadoop Blindly chops down a Input File Data Nodes big file into 128MB chunks 128MB Values of records are not considered 128MB Relevant records are typically assigned to 128MB two different blocks HDFS is too restrictive where files cannot be 128MB modified
Spatial Distributed File System Default Partitioning Spatial Partitioning
Uniform Grid Works only for uniformly distributed data
R-tree Read a sample Bulk load the sample into an R-tree Leaf node capacity C 𝑙. 𝐶 𝐷 = 𝑆 (1 + 𝛽) k: Sample size B: HDFS Block capacity |R|: Input size α: Index overhead Use MBR of leaf nodes as partition boundaries
R-tree Read a sample Bulk load the sample into an R-tree Leaf node capacity C 𝑙. 𝐶 𝐷 = 𝑆 (1 + 𝛽) k: Sample size B: HDFS Block capacity |R|: Input size α: Index overhead Use MBR of leaf nodes as partition boundaries Partition the data
R-tree Read a sample Bulk load the sample into an R-tree Leaf node capacity C 𝑙. 𝐶 𝐷 = 𝑆 (1 + 𝛽) k: Sample size B: HDFS Block capacity |R|: Input size α: Index overhead Use MBR of leaf nodes as partition boundaries Partition the data Optional: Build R-tree Local indexes
R-tree-based Index of a 400 GB road network
Non-indexed Heap File
Operations Applications: SHAHED [ICDE’15] – MNTG [SSTD’13, ICDE’14] TAREEG[SIGMOD’14, SIGSPATIAL’14] Language Visualization Pigeon [ICDE’14] [VLDB’15, ICDE’16 ] VLDB’13 ST-Hadoop Basic operations – CG_Hadoop ICDE’15 Operations [SIGSPATIAL’13] Spatial File Splitter MapReduce Spatial Record Reader Grid – R-tree – R+-tree – Quad tree Indexing [VLDB’15]
Operations Layer Basic Operations : e.g, Range query and KNN Spatial Join Operations Computational geometry operations: e.g., Polygon Union, Voronoi diagram, Delaunay Triangulation, and Convex Hull User-defined operations: e.g., kNN join
Range Query Use the global index Use local indexes to to prune disjoint find matching records partitions
KNN over Indexed Data First iteration runs as before and result is tested for correctness Answer is incorrect Second iteration processes other blocks that might contain an answer Answer is correct k=3
Spatial Join Partition – Join Join Directly
Spatial Join Partition – Join Join Directly Total of 36 overlapping pairs Only 16 overlapping pairs
CG_Hadoop 260x Polygon Union Skyline 29x 1x Single Spatial Convex Hull Delaunay Voronoi Machine Hadoop Hadoop Farthest/closest pair Triangulation Diagram A. Eldawy, Y. Li, M. F. Mokbel, R. Janardan . “ CG_Hadoop : Computational Geometry in MapReduce”, ACM SIGSPATIAL’13
Convex Hull Find the minimal convex polygon that contains all points Input Output
Convex Hull in CG_Hadoop Hadoop SpatialHadoop Partition Pruning Local hull Global hull
Advanced Analytics (Ongoing work) Partitioning Local VD Pruning Vertical Merge Pruning Horizontal Merge Final output
Applications Applications: SHAHED [ICDE’15] – MNTG [SSTD’13, ICDE’14] TAREEG[SIGMOD’14, SIGSPATIAL’14] Language Visualization Pigeon [ICDE’14] [VLDB’15 , ICDE’16] VLDB’13 ST-Hadoop Basic operations – CG_Hadoop ICDE’15 Operations [SIGSPATIAL’13,] Spatial File Splitter MapReduce Spatial Record Reader Grid – R-tree – R+-tree – Quad tree Indexing [VLDB’15]
SHAHED – A system for querying and visualizing spatio-temporal satellite data http://shahed.cs.umn.edu/ Visualize animated heat maps or still images Run spatio-temporal selection and aggregate queries A. Eldawy et al . “SHAHED: A MapReduce -based System for Querying and Visualizing Spatio- temporal Satellite Data”, IEEE ICDE’15 (Best poster runner-up) A. Eldawy et al . “A Demonstration of SHAHED: A MapReduce - based System for Querying and Visualizing Satellite Data”, IEEE ICDE’15
TAREEG – Web-based extractor for OpenStreetMap data using MapReduce http://tareeg.net/ L. Alarabi, A. Eldawy, R. Alghamdi , M. F. Mokbel. “TAREEG: A MapReduce -Based System for Extracting Spatial Data from OpenStreetMap ”, ACM SIGSPATIAL’14 ___ “TAREEG: A MapReduce -Based Web Service for Extracting Spatial Data from OpenStreetMap ”, SIGMOD’14
Agenda The ecosystem of SpatialHadoop Motivation Internal system design Applications Related work Performance Results Other research projects Future work
Other Big Spatial Data Systems Parallel ESRI Tools for Hadoop SpatialHadoop is the only extensible system that can be easily expanded by researchers and developers A. Eldawy and M. Mokbel. “The Era of Big Spatial Data: A Survey”, Foundations and Trends in Databases 2016
Performance Results Spatial Join Throughput of Range Running time with different indexes Query RUNNING TIME (SEC) SpatialHadoop 100 2500 2000 10 500X 1500 1 1000 1 2 4 8 16 64 128 Hilbert 500 K-d 0.1 Hadoop Quad 0 0.01 Speedup of CG_Hadoop Visualization Speedup 48X 260X 60 300 40 200 20 100 0 0 Scatter Roads Heatmap Satellite Vector Border Union Voronoi Skyline Convex Closest Farthest Plot Map Lines Hull Pair Pair Baseline HadoopViz Baseline Hadoop SpatialHadoop
Agenda The ecosystem of SpatialHadoop Motivation System design Applications Related work Performance results Future directions
Recommend
More recommend