TritonSort A Balanced Large-Scale Sorting System Alex Rasmussen , - PowerPoint PPT Presentation

TritonSort A Balanced Large-Scale Sorting System Alex Rasmussen , George Porter, Michael Conley, Radhika Niranjan Mysore, Amin Vahdat (UCSD) Harsha V. Madhyastha (UC Riverside) �� Alexander Pucher (Vienna University of Technology) ��

The Rise of Big Data Workloads • Very high I/O and storage requirements – Large-scale web and social graph mining – Business analytics – “you may also like …” – Large-scale “data science” • Recent new approaches to “data deluge”: data intensive scalable computing (DISC) systems – MapReduce, Hadoop, Dryad, … 2

Performance via scalability • 10,000+ node MapReduce clusters deployed – With impressive performance • Example: Yahoo! Hadoop Cluster Sort – 3,452 nodes sorting 100TB in less than 3 hours • But… – Less Than 3 MB/sec per node – Single disk: ~100 MB/sec • Not an isolated case – See “Efficiency Matters!”, SIGOPS 2010 3

Overcoming Inefficiency With Brute Force • Just add more machines! – But expensive, power-hungry mega-datacenters! • What if we could go from 3 MBps per node to 30? – 10x fewer machines accomplishing the same task – or 10x higher throughput 4

TritonSort Goals • Build a highly efficient DISC system that improves per-node efficiency by an order of magnitude vs. existing systems – Through balanced hardware and software • Secondary goals: – Completely “off-the-shelf” components – Focus on I/O-driven workloads (“Big Data”) – Problems that don’t come close to fitting in RAM – Initially sorting, but have since generalized 5

Outline • Define hardware and software balance • TritonSort design – Highlighting tradeoffs to achieve balance • Evaluation with sorting as a case study 6

Building a “Balanced” System • Balanced hardware drives all resources as close to 100% as possible – Removing any resource slows us down – Limited by commodity configuration choices • Balanced software fully exploits hardware resources 7

Hardware Selection • Designed for I/O-heavy workloads – Not just sorting • Static selection of resources: – Network/disk balance • 10 Gbps / 80 MBps ≈ 16 disks – CPU/disk balance • 2 disks / core = 8 cores – CPU/memory • Originally ~1.5GB/core… later 3 GB/core 8

Resulting Hardware Platform 52 Nodes: • Xeon E5520, 8 cores (16 with hyperthreading) • 24 GB RAM • 16 7200 RPM hard drives • 10 Gbps NIC • Cisco Nexus 5020 10 Gbps switch 9

Software Architecture • Staged, pipeline-oriented dataflow system • Program expressed as digraph of stages – Data stored in buffers that move along edges – Stage’s work performed by worker threads • Platform for experimentation – Easily vary: • Stage implementation • Size and quantity of buffers • Worker threads per stage • CPU and memory allocation to each stage 10

Why Sorting? • Easy to describe • Industrially applicable • Uses all cluster resources 11

Current TritonSort Architecture • External sort – two reads, two writes* – Don’t read and write to disk at same time • Partition disks into input and output • Two phases – Phase one : route tuples to appropriate on-disk partition (called a “logical disk”) on appropriate node – Phase two : sort all logical disks in parallel * A. Aggarwal and J. S. Vitter. The input/output complexity of sorting and related problems. CACM, 1988. 12

Architecture Phase One Node Reader Sender Distributor Input Disks 13

Architecture Phase One LD Coalescer Receiver Writer Distributor Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 Disk 6 Disk 7 Disk 8 Output Disks Linked list per partition 14

Node Reader Sender Distributor Reader L.D. Receiver Coalescer Writer Distributor • 100 MBps/disk * 8 disks = 800 MBps • No computation, entirely I/O and memory operations – Expect most time spent in iowait – 8 reader workers, one per input disk  All reader workers co-scheduled on a single core 15

Node Reader Sender Distributor NodeDistributor L.D. Receiver Coalescer Writer Distributor • Appends tuples onto a buffer per destination node • Memory scan + hash per tuple • 300 MBps per worker – Need three workers to keep up with readers 16

Sender & Node Reader Sender Distributor Receiver L.D. Receiver Coalescer Writer Distributor • 800 MBps (from Reader) is 6.4 Gbps – All-to-all traffic • Must keep downstream disks busy – Don’t let receive buffer get empty – Implies strict socket send time bound • Multiplex all senders on one core (single-threaded tight loop) – Visit every socket every 20 µ s – Didn’t need epoll()/select() 17

Balancing at Scale 18

Logical Disk Node Reader Sender Distributor Distributor L.D. Receiver Coalescer Writer Distributor t 0 t 1 t 2 t 0 t 1 H(t 0 ) = 1 H(t 1 ) = N 0 1 … N 12.8 KB 19

Logical Disk Node Reader Sender Distributor Distributor L.D. Receiver Coalescer Writer Distributor • Data non-uniform and bursty at short timescales – Big buffers + burstiness = head-of-line blocking – Need to use all your memory all the time • Solution : Read incoming data into smallest buffer possible, and form chains 20

Coalescer & Node Reader Sender Distributor Writer L.D. Receiver Coalescer Writer Distributor • Copies tuples from LDBuffer chains into a single, sequential block of memory • Longer chains = larger write before seeking = faster writes – Also, more memory needed for LDBuffers • Buffer size limits maximum chain length – How big should this buffer be? 21

Node Reader Sender Distributor Writer L.D. Receiver Coalescer Writer Distributor 22

Architecture Phase Two Reader Sorter Writer Output Disks Input Disks 23

Sort Benchmark Challenge • Started in 1980s by Jim Gray, now run by a committee of volunteers • Annual competition with many categories – GraySort: Sort 100 TB • “Indy” variant – 10 byte key, 90 byte value – Uniform key distribution 24

How balanced are we? Worker Type Workers Total Throughput % Over (MBps) Bottleneck Stage Reader 8 683 13% Node-Distributor 3 932 55% LD-Distributor 1 683 13% Coalescer 8 18,593 30,000% Writer 8 601 0% Reader 8 740 3.2% Sorter 4 1089 52% Writer 8 717 0% 25

How balanced are we? Resource Utilization Phase CPU Memory Network Disk Phase 25% 100% 50% 82% One Phase 50% 100% 0% 100% Two 26

Scalability 27

Raw 100TB “Indy” Performance 0.938 TB per minute 0.02 52 nodes 0.0175 Performance per Node 6X 0.015 (TB per minute) 0.0125 0.01 0.0075 0.564 TB per minute 0.005 195 nodes 0.0025 0 Prev. Record Holder TritonSort 28

Impact of Faster Disks • 7.2K RPM  15K RPM drives • Smaller capacity means fewer LDs • Examined effect of disk speed and # LDs • Removing a bottleneck moves the bottleneck somewhere else Intermediate Logical Disks Phase One Phase One Average Write Disk Speed Per Physical Throughput Bottleneck Size (MB) (RPM) Disk (MBps) Stage 7200 315 69.81 Writer 12.6 7200 158 77.89 Writer 14.0 15000 158 79.73 LD Distributor 5.02 29

Impact of Increased RAM • Hypothesis that memory influences chain length, and thus write speed • Doubling memory indeed increases chain length, but the effect on performance was minimal • Increasing a non-bottleneck resource made it faster, but not by much RAM Per Node Phase One Throughput Average Write Size (GB) (MBps) (MB) 24 73.53 12.43 48 76.43 19.21 30

Future Work • Generalization – We have a fast MapReduce implementation – Considering other applications and programming paradigms • Automatic Tuning – Determine appropriate buffer size & count, # workers per stage for reasonable performance • Different hardware • Different workloads 31

TritonSort – Questions? • Proof-of-concept balanced sorting system • 6x improvement in per- node efficiency vs. previous record holder • Current top speed: 938 GB per minute • Future Work: Generalization, Automation http://tritonsort.eng.ucsd.edu/ 32

TritonSort A Balanced Large-Scale Sorting System Alex Rasmussen , - PowerPoint PPT Presentation

TritonSort A Balanced Large-Scale Sorting System Alex Rasmussen , George Porter, Michael Conley, Radhika Niranjan Mysore, Amin Vahdat (UCSD) Harsha V. Madhyastha (UC Riverside) Alexander Pucher (Vienna University of Technology)

TritonSort: A Balanced Large-Scale Sor4ng System Alexander Rasmussen,

Distributed load balancing Real case example using open source on commodity hardware Pavlos

Magic Lessons: Designing and Balancing Game Objects K. Robert Gutschera Director of Development

GPUnet: networking abstractions for GPU programs Mark Silberstein Technion Israel Institute

Balance of Electric and Diffusion Forces Ions flow into and out of the neuron under the forces of

Load Balancing Load Balancing: Example Example Problem Consider 6 jobs whose processing times

COVID-19 Impacts May 2020 Both the depth and the duration of the economic downturn are

1 Budg e ta ry Proc e ss T he b udg e ta ry pro c e ss is pre sc rib e d b y pro visio ns o

Schematic energy balance of ideal surface R N = H S + H L + H G Atm S 547 Lecture 10, Slide 1

Understanding balance and falls at the patient and group level in Parkinson's disease J. Lucas

conditions: The Feds declining balance sheet meets the LCR BPI-SIFMA Prudential Regulation

A High-Performance Triple Patterning Layout Decomposer with Balanced Density Bei Yu 1 Yen-Hung Lin

Straight Line Depreciation 10 9 8 7 Book Value ($millions) 6 5 4 3 2 1 0 2004

Find the right balance when implementing Point of Care Testing Kathleen David, MT (ASCP) Point

A Longitudinal View of Gender Balance in a Large Computer Science Program University of Michigan

Provisions Privacy-preserving proofs of solvency for Bitcoin exchanges Real World Crypto 2016

Deletion from Okasakis Red-Black Trees: A Functional Pearl Matt Might University of Utah

2-3-4 Trees and Red- Black Trees 204 erm CS 16: Balanced Trees 2-3-4 Trees Revealed Nodes

CSR SpMV with guaranteed workload balance Merge-based Parallel Decomposition NVIDIA Research

Tie r m o d y n a m i c s o f I n f o r m a t i o n P r o c e s s i

Structure Preserving Numerical Methods for Hyperbolic Systems of Conservation and Balance Laws

st str If implemented, creates a string representation of an instance of the

Human rights in the balance David Clark MIT Why am I here? I wrote a paper that talked about

Balance-Enforced Multi-Level Algorithm for Multi-Criteria Graph Partitioning Rmi Barat 1 , 2

TritonSort A Balanced Large-Scale Sorting System Alex Rasmussen , - PowerPoint PPT Presentation

TritonSort A Balanced Large-Scale Sorting System Alex Rasmussen , George Porter, Michael Conley, Radhika Niranjan Mysore, Amin Vahdat (UCSD) Harsha V. Madhyastha (UC Riverside) Alexander Pucher (Vienna University of Technology)

TritonSort: A Balanced Large-Scale Sor4ng System Alexander Rasmussen,

Distributed load balancing Real case example using open source on commodity hardware Pavlos

Magic Lessons: Designing and Balancing Game Objects K. Robert Gutschera Director of Development

GPUnet: networking abstractions for GPU programs Mark Silberstein Technion Israel Institute

Balance of Electric and Diffusion Forces Ions flow into and out of the neuron under the forces of

Load Balancing Load Balancing: Example Example Problem Consider 6 jobs whose processing times

COVID-19 Impacts May 2020 Both the depth and the duration of the economic downturn are

1 Budg e ta ry Proc e ss T he b udg e ta ry pro c e ss is pre sc rib e d b y pro visio ns o

Schematic energy balance of ideal surface R N = H S + H L + H G Atm S 547 Lecture 10, Slide 1

Understanding balance and falls at the patient and group level in Parkinson's disease J. Lucas

conditions: The Feds declining balance sheet meets the LCR BPI-SIFMA Prudential Regulation

A High-Performance Triple Patterning Layout Decomposer with Balanced Density Bei Yu 1 Yen-Hung Lin

Straight Line Depreciation 10 9 8 7 Book Value ($millions) 6 5 4 3 2 1 0 2004

Find the right balance when implementing Point of Care Testing Kathleen David, MT (ASCP) Point

A Longitudinal View of Gender Balance in a Large Computer Science Program University of Michigan

Provisions Privacy-preserving proofs of solvency for Bitcoin exchanges Real World Crypto 2016

Deletion from Okasakis Red-Black Trees: A Functional Pearl Matt Might University of Utah

2-3-4 Trees and Red- Black Trees 204 erm CS 16: Balanced Trees 2-3-4 Trees Revealed Nodes

CSR SpMV with guaranteed workload balance Merge-based Parallel Decomposition NVIDIA Research

Tie r m o d y n a m i c s o f I n f o r m a t i o n P r o c e s s i

Structure Preserving Numerical Methods for Hyperbolic Systems of Conservation and Balance Laws

__ __st str__ __ If implemented, creates a string representation of an instance of the

Human rights in the balance David Clark MIT Why am I here? I wrote a paper that talked about

Balance-Enforced Multi-Level Algorithm for Multi-Criteria Graph Partitioning Rmi Barat 1 , 2

st str If implemented, creates a string representation of an instance of the