Envisioning a Parallel File System without Dedicated Metadata - PowerPoint PPT Presentation

What’s Beyond IndexFS & BatchFS Envisioning a Parallel File System without Dedicated Metadata Servers Qing Zheng Kai Ren, Garth Gibson, Bradley W. Settlemyer, Gary Grider Carnegie Mellon University Los Alamos National Laboratory

Scaling needs decoupling • NASD [ asplos98 ] 5,000 IndexFS_Lustre (32 clients run IndexFS) Lustre (single server, 32 clients) Throughput (Kop/s) o decoupling data from 500 metadata 30x 30x o Lustre, Google FS, etc 300x 300x 50 100x 100x • IndexFS [ sc14 ] o dynamically partitioned 5 metadata middleware o orders of magnitude faster 0 than Lustre in metadata empty file creation file lookup file deletion Exa- scaling demands ever more decoupling Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 2

Compute-side server code 25,000 • BatchFS [ pdsw14 ] 16 servers, 64 clients 19,692 File Creates (Kop/s) 20,000 o decoupling clients from servers 15,000 o temporarily scale beyond 30X 30X the total number of servers 10,000 o very fast for a while and eventually clients 5,000 618 communicate with servers 0 to merge updates IndexFS BatchFS How much further can we delay & decouple merging ? Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 3

∆FS Goal • Want the peak Tput BatchFS demonstrated • Compel freedom from server synchronization o by eliminating all server machines o by dealing with issues rising from the absence of metadata servers o by not assuming an underlying PFS Scale beyond BatchFS Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 4

Agenda • DeltaFS design • Why no dedicated servers is not a problem Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 5

Middleware Design ∆FS is middleware spawned by each parallel app App P1 P2 P3 Pn … App App ∆FS App App obje bject ct st stor ore e st stor orin ing dat g data/ a/metada metadata ta Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 6

∆FS Overview FS defined by a set of snapshots stored as sets of FS snapshot napshot metadata logs and data objects / e / b c / d a list a l ist of of metad adata ata op ops b c b e Logic Lo ical al Vi View Ob Obje ject ct St Stor orag age Log Log Log Log Note: data objects not shown here rena ename me /d->/e >/e rmdi dir /c /c Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 7

System Model Reads input dataset from an existing FS snapshot Creates a new snapshot with output data inserted in input ut sn snapshot apshot a ne new sn snapshot apshot ready dy to be us used by fu futur ure apps produce duced by a previous vious app input create Lo Logi gica cal l Vi View App Ob Obje ject ct St Stor orag age Log Log Log Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 8

Key take-away • NO global namespace Each namespace is defined by the app and the logs loaded by it • NO false sharing Apps don’t access logs not needed by them • NO dedicated metadata servers App directly communicates with the storage to load/dump metadata logs Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 9

How logs are implemented ? • TableFS [ atc13 ] o namespace = a large dir entry table + embedded inodes • implemented as LS LSM-Tree ee (a collection of ordered B-Trees) • Each log object is a differential B-Tree (diff) o representing a set of recent updates (e.g. newly inserted/modified inodes) k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v Log k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 10

Why LSM-Tree is a good idea ? • Logs are 1 st – class data No need to replay logs to recover namespaces Near-zero cost of merging namespaces • Each log is self-indexed Scanning/reading within a single log is fast: O(logN) Scanning/reading a series of non-overlapping logs is as fast as a single log Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 11

Agenda • DeltaFS design • Why no dedicated servers is not a problem Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 12

P1: Do my apps need the FS to communicate/synchronize ? Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 13

Unrelated Apps W ork on different datasets and don’t communicate. / climate ocean App 2 App 1 pacific atlantic Don’t need the FS to communicate Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 14

Self-Coordinating Apps Use middleware to share faster & more efficiently Parallel Scientific App MPI MPI P1 P2 P3 File Don’t need the FS to communicate Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 15

Workflow Apps Externally coordinated by job schedulers / login_log user_profile job scheduler workflow engine Reducer Mapper movie_profile Iter3 Iter4 Don’t need the FS to communicate Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 16

Anonymous Synchronization e.g. Two app instances competing for mastership App 2 App 2 App 1 App 1 Lustre Zookeeper (ZAB), .LOCK .LOCK Paxos, Raft Turn to a mechanism outside the FS to coordinate Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 17

Anonymous Synchronization e.g. Two app instances competing for mastership App 2 App 2 App 1 App 1 Lustre Zookeeper (ZAB), .LOCK .LOCK Paxos, Raft Turn to a mechanism outside the FS to coordinate Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 18

P2: But I often use different programs to access data concurrently ! Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 19

User requested concurrent sharing Mon App attach ch ∆FS P1 P2 P3 Pn … Viz ∆FS ∆FS attach ch Link to ∆ FS middleware and attach to the primary parallel app Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 20

P3: Which snapshots to use ? Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 21

Which snapshots to use ? Option 1: rely on job schedulers to automate namespace propagation input=… job scheduler App output=… workflow engine input=… App_1 App output=… App_2 App Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 22

Which snapshots to use ? Option 2: ask external registries using search predicates pub ublish lish 1 App snapshot registry se sear arch 2 App_1 App coll llect 3 App_2 App Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 23

Finding snapshots is like searching a page using Google • Possible search predicates o find latest stable science code for my science o find latest recommended mesh model and cleaned input data o find latest vendor recommended HW libraries • Also, there can be multiple snapshot registries Allows programmable namespace composition Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 24

P4: What about potential conflicts among different snapshots ? Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 25

Unrelated Apps Work on different portions of the namespace / climate ocean App 2 App 1 pacific atlantic Won’t generate any conflicts Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 26

Workflow Apps Access the same dataset at different time / login_log user_profile job scheduler workflow engine Reducer Mapper movie_profile Iter3 Iter4 Won’t generate any conflicts Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 27

Self-Coordinating Apps Coded to be conflict-free Parallel Scientific App MPI MPI P1 P2 P3 File Won’t generate any conflicts Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 28

Namespace composition is fast if there is no conflict • Recall: near-zero cost of merging logs o better if those logs do not overlap with each other What if there are conflicts ? Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 29

Use domain knowledge /de deltaf tafs Conflicts resolved per app’s own file_1 file_2 reconciliation policy /de deltaf tafs /de deltaf tafs /de deltaf tafs file_1 file_1 file_2(b) file_2 file_2 file_1(a) file_2(a) file_1(b) input snapshot input snapshot possible resolution outcome Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 30

Use curators to remember conflict resolution results App So no duplicated resolutions by different apps a cur urato ator r in inhe herit rits s a pre- reso solve lved d na namespace space fr from om an n app anothe ano ther nam namespac space e curat cu ator or App App an n app dir irectly tly takes es na namespa spaces s a namesp a nam espace ace fr from 2 cu curato ators rs curat cu ator or Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 31

Conclusion • Strong scalability needs strong decoupling o exiting clients synch too often with servers o removing servers force us to rethink on what is necessary o need to try radically different model for shared storage Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 32

Envisioning a Parallel File System without Dedicated Metadata - PowerPoint PPT Presentation

Whats Beyond IndexFS & BatchFS Envisioning a Parallel File System without Dedicated Metadata Servers Qing Zheng Kai Ren, Garth Gibson, Bradley W. Settlemyer, Gary Grider Carnegie Mellon University Los Alamos National Laboratory Scaling

Chapter 22 Envisioning Design Todd Knoll Overview Definition of Envisioning Design

File Management What is a file? Elements of file management File organization

Parallel File Systems John White Lawrence Berkeley National Lab Topics Defining a File

Click on M odel File for CAD Click on M odel File for CAD Click on Model File for CAD Click

~FILE SYSTEM~ SUNU WIBIRAMA OUTLINE FILE SYSTEM ACCESS METHODS DIRECTORY STRUCTURE FILE

CPSC 410/611: File Management What is a file? Elements of file management File

Week 10: File Management What is a file? Elements of file management File

File System Implementation Summer 2016 Cornell University Today File allocation Unix

FILE SYSTEM IMPLEMENTATION Sunu Wibirama Outline File-System Structure File-System

[537] Distributed Systems Chapters 42 Tyler Harter 11/19/14 File-System Case Studies Local -

2018 Year in Review Winona 10 dedicated Winona 10 dedicated Winona 10 dedicated Winona 10

File Systems: Semantics & Structure What is a File a file is a named collection of

File Systems: Semantics & Structure What is a File a file is a named collection of

Chapter 12: File System Implementation File System Structure File System Implementation

File Systems Chapter 11, 13 OSPP What is a File? What is a Directory? Goals of File System

What if... There is no file with the name given to the File constructor: new File

STORAGE@TGCC & LUSTRE FILESYSTEMS WORKING & BEST PRACTICES Philippe DENIEL | |

Signature Synthesizer Jonas Zaddach Mariano Graziano @jzaddach @emd3l INTRODUCTION Mariano

Motivation for IDS Developing absolutely secure systems is Intrusion Detection (IDS) not

Computer Security DD2395 http://www.csc.kth.se/utbildning/kth/kurser/DD2395/dasak10/ Spring 2010

NFSv4.1/pNFS Ready for Prime Time Deployment February 15, 2012 FAST 2012 San Jose NFSv4.1

elfutils debuginfo-server necessary non-evil Mark Wielaard, Frank Ch. Eigler Red Hat

Discussion on Space-Efficient Block Storage Integrity Moderated by Sam Small 600.624 Advanced

Firewalls/Detection CS 161: Computer Security Prof. Raluca Ada Popa March 8, 2018 Controlling

Envisioning a Parallel File System without Dedicated Metadata - PowerPoint PPT Presentation

Whats Beyond IndexFS & BatchFS Envisioning a Parallel File System without Dedicated Metadata Servers Qing Zheng Kai Ren, Garth Gibson, Bradley W. Settlemyer, Gary Grider Carnegie Mellon University Los Alamos National Laboratory Scaling

Chapter 22 Envisioning Design Todd Knoll Overview Definition of Envisioning Design

File Management What is a file? Elements of file management File organization

Parallel File Systems John White Lawrence Berkeley National Lab Topics Defining a File

Click on M odel File for CAD Click on M odel File for CAD Click on Model File for CAD Click

~FILE SYSTEM~ SUNU WIBIRAMA OUTLINE FILE SYSTEM ACCESS METHODS DIRECTORY STRUCTURE FILE

CPSC 410/611: File Management What is a file? Elements of file management File

Week 10: File Management What is a file? Elements of file management File

File System Implementation Summer 2016 Cornell University Today File allocation Unix

FILE SYSTEM IMPLEMENTATION Sunu Wibirama Outline File-System Structure File-System

[537] Distributed Systems Chapters 42 Tyler Harter 11/19/14 File-System Case Studies Local -

2018 Year in Review Winona 10 dedicated Winona 10 dedicated Winona 10 dedicated Winona 10

File Systems: Semantics &amp; Structure What is a File a file is a named collection of

File Systems: Semantics &amp; Structure What is a File a file is a named collection of

Chapter 12: File System Implementation File System Structure File System Implementation

File Systems Chapter 11, 13 OSPP What is a File? What is a Directory? Goals of File System

What if... There is no file with the name given to the File constructor: new File

STORAGE@TGCC &amp; LUSTRE FILESYSTEMS WORKING &amp; BEST PRACTICES Philippe DENIEL | |

Signature Synthesizer Jonas Zaddach Mariano Graziano @jzaddach @emd3l INTRODUCTION Mariano

Motivation for IDS Developing absolutely secure systems is Intrusion Detection (IDS) not

Computer Security DD2395 http://www.csc.kth.se/utbildning/kth/kurser/DD2395/dasak10/ Spring 2010

NFSv4.1/pNFS Ready for Prime Time Deployment February 15, 2012 FAST 2012 San Jose NFSv4.1

elfutils debuginfo-server necessary non-evil Mark Wielaard, Frank Ch. Eigler Red Hat

Discussion on Space-Efficient Block Storage Integrity Moderated by Sam Small 600.624 Advanced

Firewalls/Detection CS 161: Computer Security Prof. Raluca Ada Popa March 8, 2018 Controlling

File Systems: Semantics & Structure What is a File a file is a named collection of

File Systems: Semantics & Structure What is a File a file is a named collection of

STORAGE@TGCC & LUSTRE FILESYSTEMS WORKING & BEST PRACTICES Philippe DENIEL | |