Lessons and Predictions from 25 Years of Parallel Data Systems - PowerPoint PPT Presentation

Lessons and Predictions from 25 Years of Parallel Data Systems Development PARALLEL DATA STORAGE WORKSHOP SC11 BRENT WELCH DIRECTOR, ARCHITECTURE

OUTLINE § Theme • Architecture for robust distributed systems • Code structure § Ideas from Sprite • Naming vs I/O • Remote Waiting • Error Recovery § Ideas from Panasas • Distributed System Platform • Parallel Declustered Object RAID § Open Problems, especially at ExaScale • Getting the Right Answer • Fault Handling • Auto Tuning • Quality of Service Parallel Data Storage Workshop SC11 2

WHAT CUSTOMERS WANT § Ever Scale, Never Fail, Wire Speed Systems • This is our customer’s expectation § How do you build that? • Infrastructure • Fault Model Parallel Data Storage Workshop SC11 3

IDEAS FROM SPRITE § Sprite OS • UC Berkeley 1984 to 1990’s under John Ousterhout • Network of diskless workstations and file servers • From scratch on Sun2, Sun3, Sun4, DS3100, SPUR hardware − 680XX, 8MHz, 4MB, 4-micron, 40MB, 10Mbit/s (“Mega”) • Supported 5 professors and 25-30 grad student user population • 4 to 8 grad students built it. Welch, Fred Douglas, Mike Nelson, Andy Cherenson, Mary Baker, Ken Shirriff, Mendel Rosenblum, John Hartmann § Process Migration and a Shared File System • FS cache coherency • Write back caching on diskless file system clients • Fast parallel make • LFS log structured file system § A look under the hood • Naming vs I/O • Remote Waiting • Host Error Monitor Parallel Data Storage Workshop SC11 4

VFS: NAMING VS IO § Naming § 3 implementations each API • Create, Open, GetAttr, SetAttr, • Local kernel Delete, Rename, Hardlink • Remote kernel • User-level process § I/O § Compose different naming • Open, Read, Write, Close, Ioctl and I/O cases RPC Service POSIX System Call API Name API I/O API User Space Local RPC Daemons (Devices, FS) Remote Kernel Parallel Data Storage Workshop SC11 5

NAMING VS I/O SCENARIOS Diskless Node File Server(s) Local Devices Names for Devices and Files /dev/console Storage for Files /dev/keyboard Special Node Directory tree is on file servers Shared Devices Devices are local or on a specific host /host/allspice/dev/tape Namespace divided by User Space Daemon prefix tables User-space daemons /tcp/ipaddr/port do either/both API Parallel Data Storage Workshop SC11 6

SPRITE FAULT MODEL Kernel Operation OK or ERROR WOULD_BLOCK RPC Timeout UNBLOCK RECOVERY Parallel Data Storage Workshop SC11 7

REMOTE WAITING § Classic Race • WOULD_BLOCK reply races with UNBLOCK message • Race ignores unblock and request waits forever § Fix: 2-bits and a generation ID • Process table has “MAY_BLOCK” and “DONT_WAIT” flag bits • Wait generation ID incremented when MAY_BLOCK is set • DONT_WAIT flag is set when race is detected based on generation ID Op Request MAY_BLOCK generation++ UNBLOCK DONT_WAIT Parallel Data Storage Workshop SC11 8

HOST ERROR MONITOR § API: Want Recovery, Wait for Recovery, Recovery Notify • Subsystems register for errors • High-level (syscall) layer waits for error recovery § Host Monitor • Pings remote peers that need recovery • Triggers Notify callback when peer is ready • Makes all processes runnable after notify callbacks complete Notify Want Remote Kernel Host Monitor Ping Parallel Data Storage Workshop SC11 9

SPRITE SYSTEM CALL STRUCTURE § System call layer handles blocking conditions, above VFS API Fs_Read (streamPtr, buffer, offset, lenPtr) { setup parameters in ioPtr while (TRUE) { Sync_GetWaitToken (&waiter); rc = ( fsio_StreamOpTable[streamType].read ) (streamPtr, ioPtr, &waiter, &reply); if (rc == FS_WOULD_BLOCK) { rc = Sync_ProcWait (&waiter); } if (rc == RPC_TIMEOUT || rc == FS_STALE_HANDLE || rc == RPC_SERVICE_DISABLED) { rc = Fsutil_WaitForRecovery (streamPtr->ioHandlePtr, rc); } break or continue as appropriate } Parallel Data Storage Workshop SC11 10

SPRITE REMOTE ACCESS § Remote kernel access uses RPC and must handle errors Fsrmt_Read (streamPtr, ioPtr, waitPtr, replyPtr) { loop over chunks of the buffer { rc = Rpc_Call (handle, RPC_FS_READ, parameter_block); if (rc == OK || rc == FS_WOULD_BLOCK) { update chunk pointers continue, or break on short read or FS_WOULD_BLOCK } else if (rc == RPC_TIMEOUT) { rc = Fsutil_WantRecovery (handle); break; } if (done) break; } return rc; } Parallel Data Storage Workshop SC11 11

SPRITE ERROR RETRY LOGIC § System Call Layer § Subsystem • Sets up to prevent races • Takes Locks • Tries an operation • Detects errors and registers the • Waits for blocking I/O or error problem recovery w/out locks held • Reacts to recovery trigger • Notifies waiters RPC Service POSIX System Call API Sync_ProcWait Fsutil_WaitForRecovery Name API I/O API Sync_ProcWakeup, Fsutil_WantRecovery User Space Local RPC Daemons (Devices, FS) Remote Kernel Parallel Data Storage Workshop SC11 12

SPRITE § Tightly coupled collection of OS instances • Global process ID space (host+pid) • Remote wakeup • Process migration • Host monitor and state recovery protocols § Thin “Remote” layer optimized by write-back file caching • General composition of the remote case with kernel and user services • Simple, unified error handling Parallel Data Storage Workshop SC11 13

IDEAS FROM PANASAS § Panasas Parallel File System • Founded by Garth Gibson • 1999-2011+ • Commercial • Object RAID • Blade Hardware • Linux RPM to mount /panfs § Features • Parallel I/O, NFS, CIFS, Snapshots, Management GUI, Hardware/ Software fault tolerance , Data Management APIs § Distributed System Platform • Lamport’s PAXOS algorithm § Object RAID • NASD heritage Parallel Data Storage Workshop SC11 14

PANASAS FAULT MODEL File System File System Service Txn Clients File System Clients File System log Clients File System Clients Clients Txn Backup log Heartbeat, Report Error Control, Config Fault Tolerant Realm Manager Config DB Parallel Data Storage Workshop SC11 15

PANASAS DISTRIBUTED SYSTEM PLATFORM § Problem : managing large numbers of hardware and software components in a highly available system • What is the system configuration? • What hardware elements are active in the system? • What software services are available? • What software services are activated, or backup? • What is the desired state of the system? • What components are failed? • What recovery actions are in progress? § Solution : Fault-tolerant Realm Manager to control all other software services and (indirectly) hardware modules. • Distributed file system one of several services managed by the RM − Configuration management − Software upgrade − Failure Detection − GUI/CLI management − Hardware monitoring Parallel Data Storage Workshop SC11 16

MANAGING SERVICES § Control Strategy • Monitor -> Decide -> Control -> Monitor • Controls act on one or more distributed system elements that can fail • State Machines have “Sweeper” tasks to drive them periodically Configuration Update Service Action Hardware Control Decision State Machine(s) Heartbeat Status Realm Manager Generic Manager Parallel Data Storage Workshop SC11 17

FAULT TOLERANT REALM MANAGER § PTP Voting Protocol • 3-way or 5-way redundant Realm Manager (RM) service • PTP (Paxos) Voting protocol among majority quorum to update state § Database • Synchronized state maintained in a database on each Realm Manager • State machines record necessary state persistently § Recovery • Realm Manager instances fail stop w/out a majority quorum • Replay DB updates to re-joining members, or to new members RM RM RM PTP PTP Decision Decision Decision State State State Machine(s) Machine(s) Machine(s) DB DB DB Parallel Data Storage Workshop SC11 18

LEVERAGING VOTING PROTOCOLS (PTP) § Interesting activities require multiple PTP steps • Decide – Control – Monitor • Many different state machines with PTP steps for different product features − Panasas metadata services: primary and backup instances − NFS virtual server fail over (pools of IP addresses that migrate) − Storage server failover in front of shared storage devices − Overall realm control (reboot, upgrade, power down, etc.) § Too heavy-weight for file system metadata or I/O • Record service and hardware configuration and status • Don’t use for open, close, read, write Director Director Director OSD OSD OSD Server 1 Server 2 Server 3 PanFS PanFS PanFS MDS 7 MDS 12 MDS 4 Shared Storage NFSd 23 NFSd 8 NFSd 17 Parallel Data Storage Workshop SC11 19

PANASAS DATA INTEGRITY § Object RAID • Horizontal, declustered striping with redundant data on different OSDs • Per-file RAID equation allows multiple layouts − Small files are mirrored RAID-1 − Large files are RAID-5 or RAID-10 − Very large files use two level striping scheme to counter network incast § Vertical Parity • RAID across sectors to catch silent data corruption • Repair single sector media defects § Network Parity • Read back per-file parity to achieve true end-to-end data integrity § Background scrubbing • Media, RAID equations, distributed file system attributes Parallel Data Storage Workshop SC11 20

Lessons and Predictions from 25 Years of Parallel Data Systems - PowerPoint PPT Presentation

Lessons and Predictions from 25 Years of Parallel Data Systems Development PARALLEL DATA STORAGE WORKSHOP SC11 BRENT WELCH DIRECTOR, ARCHITECTURE OUTLINE Theme Architecture for robust distributed systems Code structure Ideas

1 Predictions for 2020 Predictions for 2020 We will live in flying houses. 1966

May 2018 ALL THINGS ADAPTED LESSONS What are adapted lessons? therapeutic music lessons

Lessons Learned Lessons Learned From From Lessons Learned Lessons Learned From From

Lessons Learned in the Challenge: Making Predictions and Scoring Them Jukka Kohonen Jukka

On Human Predictions with Explanations and Predictions of Machine Learning Models: A Case Study

D r o u g h 33.6 -45.5 33.6 -45.5 11 years 5.4 years 11 years 5.4 years years t

Time Predictions in Uber Eats Zi Wang@Uber QCon New York 2019 June 2019 Agenda 1. ML in Uber

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Overview Why Parallel Sorting? Parallel Quicksort Bitonic Sort Parallel Merge Sort

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Introduction Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources of

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources

A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric A Massively Parallel

SPARQLStream: Ontology- based access to data streams Jean-Paul Calbimonte, Oscar Corcho

PALADYN The construction of humanoid robotic head Krzysztof Luks, Piotr Ka zmierczak {

Grid Cells and Path Integration Computational Models of Neural Systems Lecture 3.7 David S.

Welcome Welcome From Calder House School, Wiltshire The Rationale, Rules-of-Thumb, And Results

CONFERENCE CALL Q1 2015 April 23, 2015 Forward-Looking Statements This presentation and its

Fog Networks Mung Chiang Princeton University 2015 From

The ABCs of ACOs for MCH May 30, 2013 For assistance: Please contact cmccoy@amchp.org or for web

Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick

Sambuz

Useful Links

Newsletter

Mail Us