spatial data management in apache spark
play

Spatial Data Management in Apache Spark The GeoSpark Perspective and - PowerPoint PPT Presentation

Spatial Data Management in Apache Spark The GeoSpark Perspective and Beyond Jia Yu THIS TALK GeoSpark overview Spatial RDD / DataFrame layer Spatial query processing layer Query optimizer GPU-based spatial database Experiments THIS TALK


  1. Spatial Data Management in Apache Spark The GeoSpark Perspective and Beyond Jia Yu

  2. THIS TALK GeoSpark overview Spatial RDD / DataFrame layer Spatial query processing layer Query optimizer GPU-based spatial database Experiments

  3. THIS TALK GeoSpark overview Spatial RDD / DataFrame layer Spatial query processing layer Query optimizer GPU-based spatial database Experiments

  4. WHAT IS GEOSPARK • A spatial data management system on top of Apache Spark since 2015 • Some statistics • Monthly: downloads > 4k, visits > 8K; Overall: downloads > 40K, visits > 100K • Was on listed as Infrastructure Project on Apache Spark official 3rd party project page • Users and contributors from Apple, Facebook, Uber, numerous startup companies • Evaluation from a recent Very Large Data Bases (VLDB) 2018 research paper “GeoSpark comes close to a complete spatial analytics system because of data types and queries supported and the control user has while writing applications. It also exhibits the best performance in most cases. ” - How Good Are Modern Spatial Analytics Systems? PVLDB Vol11

  5. GEOSPARK OVERVIEW Spatial SQL API SELECT superhero.name FROM city, superhero Query Optimizer Scala/Java RDD API WHERE ST_Contains(city.geom, superhero.geom) AND city.name = 'Gotham'; Spatial Query Processing Layer Query result Range Join Distance Join Range Distance KNN Query optimization Spatial RDD Layer Global Spatial RDD Partitioner Spatial Index Spatial RDD / DataFrame Spatial RDD Point, Polygon, Line string ... Geometrical Operations Library Spatial partitioning, Index

  6. THIS TALK GeoSpark overview Spatial RDD / DataFrame layer Spatial query processing layer Query optimizer GPU-based spatial database Experiments

  7. HETEROGENEOUS GEOMETRIES How about this? It can be even more complex i.e., country boundaries SELECT ST_GeomFromWKT ( TaxiTripRawTable.pickuppointString ) FROM TaxiTripRawTable

  8. CUSTOM SERIALIZER Spark … … … ? Serializing Byte array … … … De-serializing Byte array GeoSpark … Byte array Point, Polygon, LineString, …, Spatial index….

  9. SPATIAL PARTITIONING Range query, Join query Not scalable Scalable and fast

  10. SPATIAL PARTITIONING Uniform grids KDB-Tree Quad-Tree R-Tree

  11. SPATIAL INDEXING R-Tree, Quad-Tree Worker R-Tree, Quad-Tree Global spatial partitioning grid file Worker Master

  12. THIS TALK GeoSpark overview Spatial RDD / DataFrame layer Spatial query processing layer Query optimizer GPU-based spatial database Experiments

  13. SPATIAL RANGE QUERY SELECT * FROM TaxiTripTable WHERE ST_Contains(Manhattan, TaxiTripTable.pickuppoint)

  14. SPATIAL KNN QUERY Recycle KNN Query local index Sort Take first K Query Narrow Wide dependency dependency Index Index Index Output file Stage 1 Stage 2 Data Small Intermediate Intermediate Indexed SRDD Flow shuffle SRDD SRDD (cached) SELECT ST_Neighbors(MyLocation Restaurants.Locations, 20) FROM Restaurants

  15. SPATIAL JOIN QUERY SELECT * FROM TaxiZones, TaxiTripTable WHERE ST_Contains(TaxiZones.bound, TaxiTripTable.pickuppoint)

  16. SPATIAL JOIN QUERY Broadcast Join algorithm GSJoin algorithm Without spatial partitioning With spatial partitioning on two input SRDDs Recycle Index Index Index SRDD A - Indexed SRDD B (repartitioned, cached) (repartitioned) Wide Small Narrow dependency Zip partitions by ID dependency shuffle Shuffle Index Index Index Intermediate SRDD Narrow Query local index dependency Stage 1 Stage 1 Result SRDD Join Query Data Flow Data Flow

  17. THIS TALK GeoSpark overview Spatial RDD / DataFrame layer Spatial query processing layer Query optimizer GPU-based spatial database Experiments

  18. QUERY OPTIMIZER (V1.2.0) • Heuristics-Based Optimization • Cost-Base Optimization GeoSpark System GeoSpark Heuristic Rules Catalog Expressions SQL Unresolved Logical Optimized Analysis Logical Plan DataFrames Logical Plan Optimization Logical Plan Physical Physical Cost Selected Code DataFrame Planning Plans Models Physical Plan Generation GeoSpark Cost- GeoSpark based Strategies Statistics

  19. HEURISTICS BASED OPTIMIZATION • Predicate pushdown SELECT * FROM TaxiStopStations, TaxiTripTable WHERE ST_Contains(TaxiStopStations.bound, TaxiTripTable.pickuppoint) AND ST_Contains(Manhattan, TaxiStopStations.bound) Result Result Range Join Range filter: Broadcast or GSJoin Manhattan Range filter: Range filter: Range Join Manhattan Manhattan Broadcast or GSJoin Pickup points Taxi stops Pickup points Taxi stops (a) No predicate pushdown (b) With predicate pushdown

  20. HEURISTICS BASED OPTIMIZATION • Predicate merging SELECT * FROM TaxiTripTable WHERE ST_Contains(Manhattan, TaxiTripTable.pickuppoint) AND ST_Contains(Queens, TaxiTripTable.pickuppoint) (a) AND, take the intersection SELECT * FROM TaxiTripTable WHERE ST_Contains(Manhattan, TaxiTripTable.pickuppoint) OR ST_Contains(Queens, TaxiTripTable.pickuppoint) (b) OR, take the union

  21. HEURISTICS BASED OPTIMIZATION • Intersection query rewrite SELECT ST_Intersection(Lions.habitat, Zebras.habitat) FROM Lions, Zebras Cross join, slow SELECT ST_Intersection(Lions.habitat, Zebras.habitat) FROM Lions, Zebras WHERE ST_Intersects(Lions.habitat, Zebras.habitat); Optimized GeoSpark inner join, fast

  22. COST BASED OPTIMIZATION • Cost: based on GeoSpark statistics, MBR, count • Index scan selection: Index scan VS DataFrame scan, based on query selectivity • Spatial join algorithm selection: partition-wise GeoSpark join VS broadcast join

  23. THIS TALK GeoSpark overview Spatial RDD / DataFrame layer Spatial query processing layer Query optimizer GPU-based spatial database Experiments

  24. MapD Kinetica GeoSpark Distributed Yes, late 2017 Yes Yes SpatialSQL Yes Yes, limited Yes Compact in-mem No No Yes geometry, index Distributed spatial Yes, dist. Quad-Tree, No, nested loop No index R-Tree Distributed spatial No, still hash or Yes, 4 spatial No data partitioning round-robin partition methods Opt. distributed No No Yes spatial join Spatial query No No Yes, HBO, CBO optimizer Fault tolerance No, fail right away Yes Yes, RDD lineage SQL CodeGen Yes No Yes Streaming Yes Yes Yes No, but + MapD, Storage system Yes Yes +Kinetica

  25. REFERENCE • MapD RoadMap: https://github.com/mapd/mapd- core/blob/master/ROADMAP .md • Kinetica: https://www.kinetica.com/product/faq/

  26. THIS TALK GeoSpark overview Spatial RDD / DataFrame layer Spatial query processing layer Query optimizer GPU-based spatial database Experiments

  27. JOIN QUERY 1.3 billion points join 171 thousand polygons 72.7 million line strings join 171 thousand polygons 4 machines 263 million polygons join 171 thousand polygons Point Polygon LineString Point PolygonLineString

  28. CONCLUSION • GeoSpark is the fastest approach compared to other systems • For join query, GeoSpark has the least memory because it can make Spark quickly serialize/ deserialize data without having too much intermediate data be sticking in memory

  29. QUESTIONS?

  30. THE IMPACT OF INDEX Point data Polygon data

  31. CONCLUSION 2 • Spatial index is only helpful when prune complex shapes because of filter and refine model

  32. THE IMPACT OF SPATIAL PARTITIONING Point Polygon LineString

  33. CONCLUSION 3 • KDB-Tree partition is the most load-balanced • Quad-Tree is better • R-Tree is the worst but better than uniform grids

  34. POINT RANGE

  35. POLYGON RANGE

  36. LINE STRING RANGE

Recommend


More recommend