the era of big spatial data
play

The Era of Big Spatial Data Ahmed Eldawy Computer Science and - PowerPoint PPT Presentation

The Era of Big Spatial Data Ahmed Eldawy Computer Science and Engineering University of California - Riverside Claudius Ptolemy (AD 90 AD 168) Al Idrisi (10991165) Cholera cases in the London epidemic of 1854 Cool computer


  1. The Era of Big Spatial Data Ahmed Eldawy Computer Science and Engineering University of California - Riverside

  2. Claudius Ptolemy (AD 90 – AD 168)

  3. Al Idrisi (1099–1165)

  4. Cholera cases in the London epidemic of 1854

  5. Cool computer technology..!! Can I use it in my My pleasure. application? Here it is. I have BIG data. I need HELP..!! Oh..!! But, it is not made for me. Can’t make use of it as is

  6. 1969 Kindly let me Kindly let me get the understand your technology you needs have

  7. Cool Database mmm…Let me technology..!! check with my Can I use it in my good friends there. HELP..!! I have application? My pleasure. BIG data. Your Here it is. technology is not Oh..!! But, it is not helping me made for me. Can’t make use of it as is

  8. Kindly let me Kindly let me get the understand your technology you needs have

  9. Cool Big Data technology..!! Let me check with Can I use it in my application? my other good friends there. My pleasure. Here it is. Oh..!! But, it’s not HELP..!! Again, made for me. Can’t I have BIG data. make use of it as is Sorry, seems like the DBMS technology Your technology is cannot scale more not helping me

  10. Kindly let me get the technology you have Kindly let me understand your needs

  11. The Era of Big Spatial Data

  12. The Rise of Big Spatial Data Traffic data Smart phones Satellite Images Sensor networks Medical data VGI Geotagged pictures Geotagged Microblogs

  13. Big Spatial Data Systems

  14. The Era of Big Spatial Data Recently, a few products have emerged …

  15. Approaches for Building A Big Spatial Data System The On-top From Scratch The Built-in Approach Approach Approach (Spatial) Spatial Modules User Program User Programs + Spatial User Programs MapReduce Language Pig APIs Hadoop Latin Java APIS + Pig Hadoop Spatial Latin Java APIS Job Monitoring Operators and Scheduling Job Monitoring + and Scheduling Job Monitoring MapReduce and Scheduling Access MapReduce Runtime Methods Runtime MapReduce + Runtime Storage Spatial Storage (HDFS) + Indexing Storage (HDFS) …

  16. System Architecture for Big Spatial Data Applications Satellite Imagery, GIS, Microblogs, Medical Imagery, … Language Visualization Single level and multilevel images Query Processing Basic Queries, Spatial Join, and Computational Geometry Indexing Grid, R-tree, Quad tree, K-d tree, …

  17. Indexing Applications Satellite Imagery, GIS, Microblogs, Medical Imagery, … Language Visualization Single level and multilevel images Query Processing Basic Queries, Spatial Join, and Computational Geometry Indexing Grid, R-tree, Quad tree, K-d tree, …

  18. Data Loading in Hadoop Hadoop Distributed File Input File Data Nodes System (HDFS) is widely used. 64MB HDFS is unaware of 64MB spatial data 64MB Challenges: Big data size HDFS files are sequential 64MB and write once

  19. Two-layer Index Layout Glo lobal l Index ndexing Data a Nodes des Loc Locally Inde ndexed HDFS Bocks cks Glob obal Inde ndex

  20. Spatial Indexing Classification 1. How to calculate number of partitions? 2. What is the type of global index? 3. What is the type of local indexes? 4. Is it a clustered or unclustered index? 5. Is it a static or dynamic index?

  21. Uniform Grid Index Apply a uniform grid of size 𝒐 × 𝒐 Scan the input and assign each record to overlapping # of Partitions User-defined [1] # of HDFS blocks [2] partitions Global Grid Local None Clustered Static [1] A. Aji, et al . “Hadoop-GIS: A High Performance Spatial Data Warehousing System over MapReduce” . In VLDB, 2013 [2] A. Eldawy and M. F. Mokbel. “SpatialHadoop: A MapReduce Framework for Spatial Data” . In ICDE, 2015.

  22. R-tree construction Sample Sort by Z-curve Divide into n ranges Scan input records and partition to the # of Partitions # of Machines n ranges Global Z-curve Construct an R-tree Local R-tree for each partition Clustered Static A. Cary, Z. Sun, V. Hristidis, and N. Rishe. “Experiences on Processing Spatial Data with MapReduce” . In SSDBM, 2009

  23. R-tree and R+-tree 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 𝑡𝑡𝑡𝐹 ■ Number of partitions (blocks): 𝑜 = 𝑐𝑐𝑐𝐹𝑐 𝐹𝑑𝐹𝑑𝐹𝑡𝐹𝑑 ■ Find partition boundaries  Step 1: Sampling  Step 2: Bulk load in an R(+)-tree  Step 3: Partition boundaries are the MBRs of leaf nodes ■ Scan input file, assign each # of Partitions # of HDFS blocks record to its partition(s) Global R(+)-tree ■ Build an R(+)-tree local index Local R(+)-tree for each partition Clustered Static A. Eldawy and M. F. Mokbel. “SpatialHadoop: A MapReduce Framework for Spatial Data” . In ICDE, 2015.

  24. Quad tree Split the input file Split1 over machines I Split2 N Create a Quad tree P for each split Split3 U Partition the leaf T Split4 nodes across # of Partitions User-defined machines Global Quad-tree [M1-M4] M3 Local Quad-tree M4 Merge leaf nodes Final tree Clustered or Unclustered M1 M2 to construct the Static final tree R. T. Whitman, M. B. Park, S. A. Ambrose, and E. G. Hoel. “Spatial Indexing and Analytics on Hadoop” . In SIGSPATIAL, 2014

  25. 𝓝𝓝 -HBase Utilizes the linear index in HBase Keeps points sorted by Z-curve order Builds a virtual Quad-tree or K-d # of Partitions # of HDFS blocks tree on top of the Global K-d tree or Quad-tree sorted order Local -- Clustered Dynamic (Insertion and Deletion) S. Nishimura, et al . “MD-HBase: Design and Implementation of an Elastic Data Infrastructure for Cloud-scale Location Services” . DAPD, 31(2), 2013

  26. Quad-tree-based trajectory index Initially, all trajectories are stored in one partition As the partition fills up, new partitions are created for new records Each partition is defined by a spatio-temporal # of Partitions # of HDFS blocks Global Quad-tree bounding box Local -- (rectangle + time interval) Clustered time Dynamic (Insertion only) Q. Ma, B. Yang, W. Qian, and A. Zhou. “Query Processing of Massive Trajectory Data Based on MapReduce” . In CLOUDDB, 2009.

  27. Multiresolution Spatio-temporal Index 2012 2013 Yearly Indexes … … jan feb dec jan feb dec jan Monthly Indexes 1 2 366 1 2 365 1 2 31 … … … Daily Indexes A Eldawy, et al , “SHAHED: A MapReduce-based System for Querying and Visualizing Spatio-temporal Satellite Data” , ICDE 2015

  28. Indexes in HDFS Index # of Partitions Global Local C U Dynamic  Hadoop-GIS User-defined Uniform grid -  R-tree building # of machines Z-curve R-tree  SpatialHadoop # of Blocks R(+)-tree R(+)-tree   ScalaGiST # of machines K-d tree GiST   ESRI-Hadoop # of machines Quad Tree Quad Tree R&Quad-tree  GeoSpark User-defined Grid   MD-HBase # of Blocks K-d tree - Quad tree   GeoMesa # of Blocks GeoHash GeoHash  Trajectory # of Blocks Quad-tree- - Insertion Index based  SHAHED # of Blocks Mulitres Aggregate Insertion temporal + Quad-tree Grid

  29. Query Processing Applications Satellite Imagery, GIS, Microblogs, Medical Imagery, … Language Visualization Single level and multilevel images Query Processing Basic Queries, Spatial Join, and Computational Geometry Indexing Grid, R-tree, Quad tree, K-d tree, …

  30. Spatial Query Processing Basic queries e.g., range query and nearest neighbor queries Spatial join queries e.g., self join, binary join, multiway join, and kNN join Computational geometry queries e.g., polygon union, Voronoi diagram construction, convex hull, and skyline Spatial data mining operations e.g., K-Means clustering, and DBSCAN Raster operations e.g., aggregation and image quality

  31. Spatial Range query in MapReduce (full scan) Split the input file Split1 RangeQuery using the default HDFS partitioning O I Each mapper scans Split2 RangeQuery U N records in the T P P assigned split U Split3 RangeQuery U T Matching records are T written to the output Split4 RangeQuery No reduce phase is required S. Zhang, J. Han, Z. Liu, K. Wang, and S. Feng. “Spatial Queries Evaluation with MapReduce” . In GCC, 2009.

  32. Range query over indexed data 1. Filter: Select overlapping partitions in the global index 2. Refine: Select matching records in each partition 3. Duplicate avoidance: remove duplicates if records are replicated in the index (e.g., R+- tree and Grid) SpatialHadoop, Hadoop-GIS, ScalaGiST, ESRI Tools, MD-HBase

  33. K-Nearest Neighbor (Full scan) Straight forward solution, M1 Top-k no index required 1. Scan the input. Calculate I M2 Top-k distance to each point. N P Top-k 2. Select top-k on each U machine M3 Top-k T 3. Combine all matches in one machine and select M4 Top-k top-k [1] S. Zhang, et al . “Spatial Queries Evaluation with MapReduce” . In GCC, 2009. [2] A. Aji, et al. “Hadoop-GIS: A High Performance Spatial Data Warehousing System over MapReduce” . In VLDB, 2013

Recommend


More recommend