The DataPath System: A Data-Centric Analytic Processing Engine for Large Data Warehouses Subi Arumugam 1 ,Alin Dobra 1 ,Christopher M. Jermaine 2 , Niketan Pansare 2 ,Luis Perez 2 1 University of Florida, 2 Rice University June 9, 2010
Motivation • Storage is cheap: 1TB disk is 80-100$ • Disks have high throughput • 100$ 1TB disk can do 150MB/s reads/writes • 4,000$ 1TB SSD (OCZ p88) reads at 1.4GB/s • Processors are fast: 6GFLOPs/Core, 24GFrops for 100$ • TPC-H Q1 ( at 1TB scale factor ) • 8 Aggregates over 95-97% of lineitem • need to read about 160-700GB: 2 P88 scan in 60-250s • need to perform 30FLOPs*6 · 10 9 =180GFLOPS; 8s • Q1 should be I/O bound; should do 8 in parallel
Motivation • Storage is cheap: 1TB disk is 80-100$ • Disks have high throughput • 100$ 1TB disk can do 150MB/s reads/writes • 4,000$ 1TB SSD (OCZ p88) reads at 1.4GB/s • Processors are fast: 6GFLOPs/Core, 24GFrops for 100$ • TPC-H Q1 ( at 1TB scale factor ) • 8 Aggregates over 95-97% of lineitem • need to read about 160-700GB: 2 P88 scan in 60-250s • need to perform 30FLOPs*6 · 10 9 =180GFLOPS; 8s • Q1 should be I/O bound; should do 8 in parallel • Best non-clustered performer: 142s for 1.7M$ • 64 cores, 512GB memory, 576 disks
Large Scale Analytics Goals • Deal with analytical queries on large data (1-10TB) • Get closer to theoretical CPU performance • gap stands at 100-1000 for most databases • Sub 100,000$ system with minute response time (1TB) • stay I/O bound even with fast disks and multiple queries • No or little tuning: no indexing, no tunable partitioning
Large Scale Analytics Goals • Deal with analytical queries on large data (1-10TB) • Get closer to theoretical CPU performance • gap stands at 100-1000 for most databases • Sub 100,000$ system with minute response time (1TB) • stay I/O bound even with fast disks and multiple queries • No or little tuning: no indexing, no tunable partitioning DataPath • System designed from ground up to meet these goals.
Benchmark System Old System (2008) – 60,000 $ • 8 processors, 32 cores • 128GB DDR2 RAM (16 bays) • 2 Averatec RAID controlless, 4 12-disk enclosures • 47 Velociraptor Disks, 8 Baracuda disks • Maximum aggregate throughput 2.2GB/s New System (2010) – 20,000 $ • 4 processors, 48 cores • 128GB DDR3 memory • 2 OCZ Z-drive 1TB PCI SSD disks • Maximum aggregate throughput 2.8GB/s
Data-centric Computation
DataPath Execution Model • Tuple-oriented execution model • Tuples shared by queries in the system • Chunks of tuples pushed into waypoints for processing • Waypoints implement operations for multiple queries • Tuple processing loops at full CPU speed for (int i = 0; true; i++) { if (tuple[i].BelongsTo (Q1)) Q1.Process (tuple[i]); if (tuple[i].BelongsTo (Q2)) Q2.Process (tuple[i]); if (tuple[i].BelongsTo (Q3)) Q3.Process (tuple[i]); }
Query Execution – Example Q 1 : SELECT SUM (l quantity) FROM lineitem WHERE l shipdate > ’1-1-06’; out � Q 1 : SUM(l_quantity) � Q 1 : l_shipdate > ‘1-1-06’ Q 1 lineitem orders
Query Execution – Example Q 1 : SELECT SUM (l quantity) FROM lineitem WHERE l shipdate > ’1-1-06’; out out Q 2 : SELECT SUM (l extendedprice) � Q 2 : SUM(l_extendedprice) FROM lineitem, order WHERE � Q 1 : SUM(l_quantity) l shipmode <> ’rail’ Q 2 : l_orderkey = o_orderkey Q 1 AND o orderdate < ’1-1-08’ Q 2 � Q 2 : o_orderdate < ‘1-1-08’ AND l orderkey = o orderkey; � Q 1 : l_shipdate > ‘1-1-06’ Q 2 : l_shipmode <> ‘rail’ Q 2 Q 1 , Q 2 orders lineitem
Query Execution – Example Q 1 : SELECT SUM (l quantity) FROM lineitem WHERE l shipdate > ’1-1-06’; out out Q 2 : SELECT SUM (l extendedprice) � Q 2 : SUM(l_extendedprice) FROM lineitem, order WHERE � Q 1 : SUM(l_quantity) l shipmode <> ’rail’ Q 2 : l_orderkey = o_orderkey Q 1 AND o orderdate < ’1-1-08’ Q 2 � Q 2 : o_orderdate < ‘1-1-08’ AND l orderkey = o orderkey; � Q 1 : l_shipdate > ‘1-1-06’ Q 2 : l_shipmode <> ‘rail’ Q 2 Q 1 , Q 2 Q 3 : SELECT AVG (l discount) FROM lineitem, orders WHERE lineitem orders o custkey = 1234 AND l orderkey = o orderkey;
Tuple Processing Loop Usual problems: • branch mis-prediction • instruction cache misses • per-tuple overhead DataPath solution – Use a C++ meta-compiler • generate new tuple processing loops for each waypoint when new queries added • code is human-readable (has even comments) • compiled as a library with -O3 -msse4.1 • everything is hardcoded • compiler finds sharing, branch-misprediction, SSE
File Scanner Staging Area Chunk 1 Chunk 2 Chunk 3 Chunk 4 Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 3 2 1 5 4 6 10 9 7 8 12 15 11 14 13 20 18 16 17 19 Finished:
File Scanner Staging Area Chunk 1 Chunk 2 Chunk 3 Chunk 4 2 Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 3 1 5 4 6 10 9 7 8 12 15 11 14 13 20 18 16 17 19 Finished:
File Scanner Staging Area Chunk 1 Chunk 2 Chunk 3 Chunk 4 2 3 Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 1 5 4 6 10 9 7 8 12 15 11 14 13 20 18 16 17 19 Finished:
File Scanner Staging Area Chunk 1 Chunk 2 Chunk 3 Chunk 4 2 5 3 Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 1 4 6 10 9 7 8 12 15 11 14 13 20 18 16 17 19 Finished:
File Scanner Staging Area Chunk 1 Chunk 2 Chunk 3 Chunk 4 1 2 5 3 Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 4 6 10 9 7 8 12 15 11 14 13 20 18 16 17 19 Finished:
File Scanner Staging Area Chunk 1 Chunk 2 Chunk 3 Chunk 4 1 2 5 3 4 Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 6 10 9 7 8 12 15 11 14 13 20 18 16 17 19 Finished:
File Scanner Staging Area Chunk 1 Chunk 2 Chunk 3 Chunk 4 1 2 5 6 3 4 Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 10 9 7 8 12 15 11 14 13 20 18 16 17 19 Finished:
File Scanner Staging Area Chunk 1 Chunk 2 Chunk 3 Chunk 4 1 2 5 6 3 4 8 Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 10 9 7 12 15 11 14 13 20 18 16 17 19 Finished: Chunk 1
File Scanner Staging Area Chunk 5 Chunk 2 Chunk 3 Chunk 4 5 6 7 8 Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 10 9 12 15 11 14 13 20 18 16 17 19 Finished: Chunk 1
File Scanner Staging Area Chunk 5 Chunk 2 Chunk 3 Chunk 4 5 6 10 7 8 Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 9 12 15 11 14 13 20 18 16 17 19 Finished: Chunk 1
File Scanner Staging Area Chunk 5 Chunk 2 Chunk 3 Chunk 4 5 6 9 10 7 8 Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 12 15 11 14 13 20 18 16 17 19 Finished: Chunk 1
Recommend
More recommend