big spatial data management on hadoop and beyond
play

Big Spatial Data Management on Hadoop and Beyond Ahmed Eldawy - PowerPoint PPT Presentation

Big Spatial Data Management on Hadoop and Beyond Ahmed Eldawy Computer Science and Engineering 1 Claudius Ptolemy (AD 90 AD 168) Al Idrisi (1099 1165) Cholera cases in the London epidemic of 1854 Cool computer technology..!! Can I


  1. Big Spatial Data Management on Hadoop and Beyond Ahmed Eldawy Computer Science and Engineering 1

  2. Claudius Ptolemy (AD 90 – AD 168)

  3. Al Idrisi (1099 – 1165)

  4. Cholera cases in the London epidemic of 1854

  5. Cool computer technology..!! Can I use it in my My pleasure. application? Here it is. I have BIG data. I need HELP..!! Oh..!! But, it is not made for me. Can’t make use of it as is

  6. 1969 Kindly let me get Kindly let me the technology understand your you have needs

  7. Cool Database mmm…Let me technology..!! check with my Can I use it in my good friends there. HELP..!! I have application? My pleasure. BIG data. Your Here it is. technology is not Oh..!! But, it is not helping me made for me. Can’t make use of it as is

  8. Kindly let me Kindly let me get the understand your technology you needs have

  9. Cool Big Data technology..!! Let me check with Can I use it in my application? my other good friends there. My pleasure. Here it is. Oh..!! But, it’s not made for me. Can’t HELP..!! Again, Sorry, seems like I have BIG data. make use of it as is the DBMS Your technology is technology cannot not helping me scale more

  10. Kindly let me get the technology you have Kindly let me understand your needs

  11. Big Spatial Data Management

  12. Tons of Spatial data out there… Geotagged Pictures Geotagged Microblogs Medical Data Sensor Networks Smart Phones Satellite Images Traffic Data VGI 26

  13. ➔ SpatialHadoop Spatial Data & Hadoop Spatial Data Hadoop SpatialHadoop points = LOAD ’points’ AS points = LOAD ’points’ AS (id:int, x:int, y:int); (id:int, location:point); result = FILTER points BY result = FILTER points BY x < xmax AND x >= xmin AND Overlap(location, rectangle y < ymax AND y >= ymin; (xmin, ymin, xmax, ymax)); Takes 193 seconds Finishes in 2 seconds 27

  14. Spatial Operations Visualization Spatial Language Spatial Indexes 80,000 downloads Conducted more than seven in one year keynotes, tutorials, and invited talks Industry Academia Students Projects Collaboration >500GB public datasets for benchmarking and testing Incubated by Eclipse Foundation and renamed to University GeoJinni of Genova 28

  15. The Built-in Approach of SpatialHadoop The On-top From Scratch The Built-in Approach Approach Approach (SpatialHadoop) (Spatial) Spatial Modules User Program + User Programs MapReduce Spatial User Programs Language APIs Hadoop Pig Java + Latin Pig Hadoop APIS Job Spatial Latin Java APIS Monitoring Operators Job Monitoring and and Scheduling Job Monitoring Scheduling + and Scheduling MapReduce Early MapReduce Pruning Runtime Runtime MapReduce + Runtime Spatial Storage Storage (HDFS) Indexing + Storage (HDFS) … 29

  16. Agenda The ecosystem of SpatialHadoop Motivation Internal system design Applications Related work Experiments Open Research Problems 30

  17. SpatialHadoop Architecture Applications: SHAHED [ICDE’15] – MNTG [SSTD’13, ICDE’14] TAREEG[SIGMOD’14, SIGSPATIAL’14] Language Visualization Pigeon [ICDE’14] [VLDB’15, ICDE’16 ] VLDB’13 ST-Hadoop Basic operations – CG_Hadoop ICDE’15 Operations [SIGSPATIAL’13] Spatial File Splitter MapReduce Spatial Record Reader Grid – R-tree – R+-tree – Quad tree Indexing [VLDB’15] 31

  18. Indexing Applications: SHAHED [ICDE’15] – MNTG [SSTD’13, ICDE’14] TAREEG[SIGMOD’14, SIGSPATIAL’14] Language Visualization Pigeon [ICDE’14] [VLDB’15, ICDE’16 ] VLDB’13 ST-Hadoop Basic operations – CG_Hadoop ICDE’15 Operations [SIGSPATIAL’13] Spatial File Splitter MapReduce Spatial Record Reader Grid – R-tree – R+-tree – Quad tree Indexing [VLDB’15] 32

  19. Data Loading in Hadoop Blindly chops down a Input File Data Nodes big file into 128MB chunks 128MB Values of records are not considered 128MB Relevant records are typically assigned to 128MB two different blocks HDFS is too restrictive where files cannot be 128MB modified 33

  20. Two-layer Index Layout Global Indexing Data Nodes Locally Indexed HDFS Bocks Global Index 34

  21. Uniform Grid Works only for uniformly distributed data 35

  22. R-tree Read a sample Bulk load the sample into an R-tree Leaf node capacity C 𝑙. 𝐶 𝐷 = 𝑆 (1 + 𝛽) k: Sample size B: HDFS Block capacity |R|: Input size α: Index overhead Use MBR of leaf nodes as partition boundaries 36

  23. R-tree Read a sample Bulk load the sample into an R-tree Leaf node capacity C 𝑙. 𝐶 𝐷 = 𝑆 (1 + 𝛽) k: Sample size B: HDFS Block capacity |R|: Input size α: Index overhead Use MBR of leaf nodes as partition boundaries Partition the data 37

  24. R-tree Read a sample Bulk load the sample into an R-tree Leaf node capacity C 𝑙. 𝐶 𝐷 = 𝑆 (1 + 𝛽) k: Sample size B: HDFS Block capacity |R|: Input size α: Index overhead Use MBR of leaf nodes as partition boundaries Partition the data Optional: Build R-tree Local indexes 38

  25. R-tree-based Index of a 400 GB road network 39

  26. Non-indexed Heap File 40

  27. MapReduce Layer Applications: SHAHED [ICDE’15] – MNTG [SSTD’13, ICDE’14  ] TAREEG[SIGMOD’14  , SIGSPATIAL’14] Language Visualization ST-Hadoop [TODS  ] Pigeon [ICDE’14  ] [VLDB’15  ,ICDE’16 ] VLDB’13  Basic operations – CG_Hadoop ICDE’15 Operations [SIGSPATIAL’13, TSAS  ] Spatial File Splitter MapReduce Spatial Record Reader Grid – R-tree – R+-tree – Quad tree Indexing [VLDB’15]  Under review  Demo paper 41

  28. Map plan – SpatialHadoop Map plan – Hadoop Input Heap Map task Indexed Input File File(s) k,v Number Spatial Record of splits Record Map Split k,v Reader Reader k,v Spatial File … File Splitter Map task Splitter k,v Spatial Record Record Split k,v Map Reader Filter Reader k,v Function 42

  29. Operations Applications: SHAHED [ICDE’15] – MNTG [SSTD’13, ICDE’14] TAREEG[SIGMOD’14, SIGSPATIAL’14] Language Visualization Pigeon [ICDE’14] [VLDB’15, ICDE’16 ] VLDB’13 ST-Hadoop Basic operations – CG_Hadoop ICDE’15 Operations [SIGSPATIAL’13] Spatial File Splitter MapReduce Spatial Record Reader Grid – R-tree – R+-tree – Quad tree Indexing [VLDB’15] 43

  30. Operations Layer Basic Operations : e.g, Range query and KNN Spatial Join Operations Computational geometry operations: e.g., Polygon Union, Voronoi diagram, Delaunay Triangulation, and Convex Hull User-defined operations: e.g., kNN join 44

  31. Range Query Use the global index to prune disjoint partitions Use local indexes to find matching records 45

  32. KNN over Indexed Data First iteration runs as before and result is tested for correctness  Answer is incorrect Second iteration processes other blocks that might contain an answer ✓ Answer is correct k=3 46

  33. Spatial Join Partition – Join Join Directly 47

  34. Spatial Join Partition – Join Join Directly Total of 36 overlapping pairs Only 16 overlapping pairs 48

  35. CG_Hadoop 260x Polygon Union Skyline 29x 1x Single Spatial Convex Hull Delaunay Voronoi Machine Hadoop Hadoop Farthest/closest pair Triangulation Diagram 49 A. Eldawy, Y. Li, M. F. Mokbel, R. Janardan . “ CG_Hadoop : Computational Geometry in MapReduce”, ACM SIGSPATIAL’13

  36. Convex Hull Find the minimal convex polygon that contains all points Input Output 50

  37. Convex Hull in CG_Hadoop Hadoop SpatialHadoop  Partition  Pruning  Local hull  Global hull 51

  38. Visualization Applications: SHAHED [ICDE’15] – MNTG [SSTD’13, ICDE’14] TAREEG[SIGMOD’14, SIGSPATIAL’14] Language Visualization Pigeon [ICDE’14] [VLDB’15, ICDE’16 ] VLDB’13 ST-Hadoop Basic operations – CG_Hadoop ICDE’15 Operations [SIGSPATIAL’13] Spatial File Splitter MapReduce Spatial Record Reader Grid – R-tree – R+-tree – Quad tree Indexing [VLDB’15] 52

  39. Visualization in HadoopViz The goal of HadoopViz is not to propose new visualization techniques, instead its goal is to scale out existing techniques. Satellite Data Road Network Heat Map Vector Map Admin Boundaries Scatter Plot 53

  40. Heat Map From 2009 to 2014 Month-by-Month 72 Frames × 14 Billion points per frame Total = 1 Trillion points Created in 3 hours on 10 nodes instead of 60 hours 54

  41. Abstract Visualization 2. Create 2. Create Input Partition canvas canvas 1. smooth 3. plot 4. merge 4. merge 1. smooth 3. plot 5. write 4. merge 1. smooth 3. plot Output Image 55

  42. Example: Satellite Data Visualization 1. Smooth : Recover holes 2. Create Canvas : Initialize a 2D Matrix with zeros 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3. Plot : 4. Merge : 5. Write : Update the matrix Matrix addition Generate the image 0 0 0 0 0 0 0 𝟑𝟑 0 0 𝟖 0 0 0 𝟐𝟔 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 56

Recommend


More recommend