Indexing the Pickup and Drop-off Locations of NYC Taxi Trips in - PowerPoint PPT Presentation

Indexing the Pickup and Drop-off Locations of NYC Taxi Trips in PostgreSQL – Lessons from the Road Jia Yu and Mohamed Sarwat Arizona State University

A Little Story… • August 1, 2015 : Over 1 billion taxi trip records from 2009 to 2015 were released by New York City Taxi & Limousine Commission • Since then: New taxi trip records keep being published on the Internet • As of TODAY: Millions of new records have been added into the dataset PickupTime DropoffTime TripDistance PickupLocation DropoffLocation PaymentType FareAmount TipAmount 2009-01-01 2009-01-01 (40.7577,- (40.7497,- 2.2 Credit Card 15.5 3.5 08:01:01 08:20:37 73.9851) 73.9882) Photo credit: NYC TLC website

A Little Story… (cont.) • People really want to do Spatial Query on this 175 GB data in PostGIS • People really need a Spatial Index to speed up the queries. Which Spatial Index can handle these? 1 billion records, 175 GB, millions new records, keep being published

Compared Approaches: GiST • Generalized Search Tree (GiST-Spatial, Similar to R-Tree) • Index structure: Tree index Root … Non-leaf Non-leaf … … Tuple Tuple Tuple Tuple Tuple Tuple • Index entry (tree node): Minimum Bounding Rectangle, Tuple pointers • Index search: Top-down, fast prune by checking Query Window with MBR • Index maintenance: Search tree, then split (if full) and merge (if too empty)

Compared Approaches: GiST • Summary of GiST • Fast index search • Large storage overhead: 20% or more additional overhead • Slow maintenance: Split, merge tree nodes Index Name Data size Index size Initial. time Insertion (0.1%) GiST 175 GB NYC 84 GB 28 hours 6 hours

Compared Approaches: BRIN-Spatial SELECT * • Block Range Index (BRIN-Spatial): FROM NYCtaxi N WHERE ST_WITHIN • PostgreSQL 9.5, PostGIS 2.3 (QueryWindow, N.pickuppoint) • Index heap file pages • Index search: • 1. Serial search by checking Query Window with MBR Heap file • 2. Filter false positive pages Disk pages • Index maintenance: • Update MBR for Insertion Data table • No update for deletion Filter false positive pages Disk pages

Compared Approaches: BRIN-Spatial • Summary of BRIN-Spatial • Index heap file pages • Very small • Fast maintenance Heap file • Not good at queries Disk pages Data table

Compared Approaches: Hippo-Spatial • Hippo-Spatial: PVLDB 2016 • Index heap file pages • Index entry: dynamic page range, partial histogram • Index search: Page False positive 1 √ • 1. Serial search by finding overlapped buckets between 2 √ Query Window and partial histogram 3 Got results! • 2. Filter false positive pages 4 Got results! Page Range Histogram Bucket ID (X,Y) Histograms on X and Y 5 √ 1,1 1,2 … 4,3 4,4 Start End 6 Got results! 4 7 √ 3 1 10 1 0 … 1 0 8 Got results! Page 1 - 10 2 9 Got results! 26 30 0 1 … 0 1 10 √ 1 Y X 1 2 3 4 11 25 0 1 … 0 0

Compared Approaches: Hippo-Spatial • Hippo-Spatial: • Index maintenance • Data insertion: eager update on partial histogram • Data deletion: lazy update on partial histogram Traverse Page Range Histogram Bucket ID (X,Y) 1,1 1,2 … 4,3 4,4 Start End Each entry Histograms on X and Y Out of date? YES. 1 10 1 0 … 1 0 4 Page 26 3 26 30 0 1 … 0 1 2 Resummarize 11 25 0 1 … 0 0 Page Range Histogram Bucket ID (X,Y) 1 26 30 0 1 … 0 1 Y X 1 2 3 4

Compared Approaches: Hippo-Spatial • Summary of Hippo-Spatial Histograms on X and Y 4 Page Range Histogram Bucket ID (X,Y) • Index heap file pages 3 1,1 1,2 … 4,3 4,4 Start End • Still small 2 • Fast maintenance 1 1 10 1 0 … 1 0 • Good at common queries Y X 1 2 3 4 26 30 0 1 … 0 1 Sorted List (Start Page# ↓ ) Pointer 11 25 0 1 … 0 0 Pointer Pointer

Experimental Environment • Datasets • NYC Taxi Trips 175 GB • Parameter setting • Hippo: Histogram bucket (H) 400, Partial histogram density(D) 20% • BRIN: Page per range (P) 128

Indexing Overhead • Index size • Hippo: 100x < GiST • BRIN: 100x < Hippo • Reason Log. scale • Index pages not tuples • Partial histogram > MBR Log. scale

Indexing Overhead (cont.) • Index initialization time • Hippo, BRIN-Spatial 100x < GiST • Hippo takes 60% time of BRIN • Reason Log. scale • Hierarchy > flat index structure • GiST writes lots of temporary disk files • BRIN in-memory entry is updated frequently

Indexing Overhead (cont.) Non-leaf GiST … Node Node Node Node Page … 1 Node Node Node 2 3 DBMS 4 5 Hippo 6 7 8 StartPageID EndPageID Bit 1 Bit 2 … Bit b DBMS 9 10 11 StartPageID EndPageID Bit 1 Bit 2 … Bit b 12 13 BRIN 14 15 DBMS StartPageID EndPageID Xmin Ymin Xmax Ymax StartPageID EndPageID Xmin Ymin Xmax Ymax

Query Response Time: vary query selectivity factor • Hippo ≈ GiST at 0.1% and 1% selectivity • BRIN is always the worst

Index Probe Time: vary query selectivity factor • Hippo and BRIN have constant index probe time • Search all index entries for a given query • GiST index probe time increases along with selectivity factor Log. scale

Inspected Pages: vary query selectivity factor • Hippo inspects 5 times less disk pages than BRIN • BRIN searches too many pages with 32, 128, 512 pages per range • Higher density makes Hippo inspect more pages

Query Response Time: vary query areas • Setting • Area: percent of NYC region area • Dense locations, Time Square, JFK,…; Random locations, random within NYC • Hippo works better in dense locations, medium selectivity factors • GiST works better in random locations, highly selective queries

Maintenance time: vary update ratio • Insertion: • Hippo 100x < GiST, flat index structure • BRIN 50x < Hippo, Hippo updates partial histogram • Deletion: Hippo 100x < GiST; BRIN > Hippo, BRIN has to re-build Log. scale

Throughput: Hybrid workloads • Queries + Updates • Update-intensive workloads (10%-50%), Hippo is 100x > GiST • Query-intensive workloads (70%-90%), Hippo ≈ GiST

Summary of Results Metric GiST-Spatial Hippo-Spatial BRIN-Spatial Storage overhead 84 GB 2 GB 10 MB Initialization time 28 hours 30 minutes 45 minutes selectivity between Favored selectivity query 0.001% selectivity X 0.01% and 1% 10 -5 % range query range query area ≥ Favored dense area query X area 10 −4 % 6 minutes for 4 seconds for 1 second for Index insertion inserting 10 −4 % data inserting 10 −4 % data inserting 10 −4 % data 2 hours for deleting 2 min for deleting Index deletion Index rebuilt 10 −4 % data 10 −4 % data Balanced Workload Hybrid workload Query-intensive Update-intensive and Update-intensive

Take-home Lesson • Do not use GiST (spatial tree index) if limited storage • Do not use BRIN or Hippo for Yelp-like applications. • Use Hippo for spatial analytics applications over dynamic and dense spatial data. • query selectivity is 0.1% - 1%, update-intensive workloads Use GiST Use Hippo

Questions? 9.6.1 kernel https://github.com/DataSystemsLab/hippo-postgresql Build Hippo CREATE INDEX hippo_idx ON hippo_tbl USING hippo (randomNumber) WITH (density = 20);

Indexing the Pickup and Drop-off Locations of NYC Taxi Trips in - PowerPoint PPT Presentation

Indexing the Pickup and Drop-off Locations of NYC Taxi Trips in PostgreSQL Lessons from the Road Jia Yu and Mohamed Sarwat Arizona State University A Little Story August 1, 2015 : Over 1 billion taxi trip records from 2009 to 2015

Understanding Employer Pickup 50-338c, 1/19/E Todays agenda Review five key items 1. Define

Interim taxi rank on Eversholt Street Community Drop-In 18 th July 2018 Contents 1. Timescales

726-6666 CO-OP TAXI 753-5100 St. John's Taxi Co-operative Society Ltd Box 11, Suite 130,

DC TAXI APP Quick Start Guide Driver & Passenger Apps WATCH THE DC TAXI APP TRAINING VIDEO

Dynamic Pickup and Delivery with Transfers P. Bouros 1 , D. Sacharidis 2 , T. Dalamagas 2 , T.

Taxis and Accessible Services Division: Status of Taxi Industry San Francisco Municipal

ENVIRONMENT & MARKET CONTEXT MINIBUS TAXI INDUSTRY IS RESILIENT, DEFENSIVE & GROWING

Deferred Retirement Option Plan - DROP October 2016 1 Deferred Retirement Option Plan - DROP

Distributed Indexing Indexing, session 8 CS6200: Information Retrieval Slides by: Jesse Anderton

Reminders 1 week until the American election. I voted. Did you? If you havent returned your

Taxi Routes for Aircraft: Creation and Controlling DLR Institut fr Flugfhrung (FL) >

Overview Yandex Services Car Detection Yandex.Taxi 3D Car Detection Yandex

Indexing Presentation - The Basics Attached is the slide deck for a short presentation on indexing

Think not lightly of good, saying, It will not come to me. Drop by drop is the water pot

the Taxi and Car Hire Industry The ATO response to Industry assistance programs. Presented by

Direct Retail Pickup Program Now and into the Future! MID-OHIO FOODBANK All Agency Exchange

Think not lightly of good, saying, "It will not come to me. Drop by drop is

Think not lightly of good, saying, "It will not come to me. Drop by drop is

T A X I L O A N S T a x i L e n d i n g w w w . g u t i e r r e z g r o u p . c o m . c o TAXI

Think not lightly of good, saying, "It will not come to me. Drop by drop is the water pot

An overview of policies and examples of taxi and private hire fleets becoming cleaner and greener

PDL Basics of Indexing and Threading Outline Motivation Indexing Dimension

Think not lightly of good, saying, "It will not come to me. Drop by drop is the water pot

Think not lightly of good, saying, "It will not come to me. Drop by drop is the water pot

Indexing the Pickup and Drop-off Locations of NYC Taxi Trips in - PowerPoint PPT Presentation

Indexing the Pickup and Drop-off Locations of NYC Taxi Trips in PostgreSQL Lessons from the Road Jia Yu and Mohamed Sarwat Arizona State University A Little Story August 1, 2015 : Over 1 billion taxi trip records from 2009 to 2015

Understanding Employer Pickup 50-338c, 1/19/E Todays agenda Review five key items 1. Define

Interim taxi rank on Eversholt Street Community Drop-In 18 th July 2018 Contents 1. Timescales

726-6666 CO-OP TAXI 753-5100 St. John's Taxi Co-operative Society Ltd Box 11, Suite 130,

DC TAXI APP Quick Start Guide Driver &amp; Passenger Apps WATCH THE DC TAXI APP TRAINING VIDEO

Dynamic Pickup and Delivery with Transfers P. Bouros 1 , D. Sacharidis 2 , T. Dalamagas 2 , T.

Taxis and Accessible Services Division: Status of Taxi Industry San Francisco Municipal

ENVIRONMENT &amp; MARKET CONTEXT MINIBUS TAXI INDUSTRY IS RESILIENT, DEFENSIVE &amp; GROWING

Deferred Retirement Option Plan - DROP October 2016 1 Deferred Retirement Option Plan - DROP

Distributed Indexing Indexing, session 8 CS6200: Information Retrieval Slides by: Jesse Anderton

Reminders 1 week until the American election. I voted. Did you? If you havent returned your

Taxi Routes for Aircraft: Creation and Controlling DLR Institut fr Flugfhrung (FL) &gt;

Overview Yandex Services Car Detection Yandex.Taxi 3D Car Detection Yandex

Indexing Presentation - The Basics Attached is the slide deck for a short presentation on indexing

Think not lightly of good, saying, It will not come to me. Drop by drop is the water pot

the Taxi and Car Hire Industry The ATO response to Industry assistance programs. Presented by

Direct Retail Pickup Program Now and into the Future! MID-OHIO FOODBANK All Agency Exchange

Think not lightly of good, saying, &quot;It will not come to me. Drop by drop is

Think not lightly of good, saying, &quot;It will not come to me. Drop by drop is

T A X I L O A N S T a x i L e n d i n g w w w . g u t i e r r e z g r o u p . c o m . c o TAXI

Think not lightly of good, saying, &quot;It will not come to me. Drop by drop is the water pot

An overview of policies and examples of taxi and private hire fleets becoming cleaner and greener

PDL Basics of Indexing and Threading Outline Motivation Indexing Dimension

Think not lightly of good, saying, &quot;It will not come to me. Drop by drop is the water pot

Think not lightly of good, saying, &quot;It will not come to me. Drop by drop is the water pot

DC TAXI APP Quick Start Guide Driver & Passenger Apps WATCH THE DC TAXI APP TRAINING VIDEO

ENVIRONMENT & MARKET CONTEXT MINIBUS TAXI INDUSTRY IS RESILIENT, DEFENSIVE & GROWING

Taxi Routes for Aircraft: Creation and Controlling DLR Institut fr Flugfhrung (FL) >

Think not lightly of good, saying, "It will not come to me. Drop by drop is

Think not lightly of good, saying, "It will not come to me. Drop by drop is

Think not lightly of good, saying, "It will not come to me. Drop by drop is the water pot

Think not lightly of good, saying, "It will not come to me. Drop by drop is the water pot

Think not lightly of good, saying, "It will not come to me. Drop by drop is the water pot