What’s Beyond IndexFS & BatchFS Envisioning a Parallel File System without Dedicated Metadata Servers Qing Zheng Kai Ren, Garth Gibson, Bradley W. Settlemyer, Gary Grider Carnegie Mellon University Los Alamos National Laboratory
Scaling needs decoupling • NASD [ asplos98 ] 5,000 IndexFS_Lustre (32 clients run IndexFS) Lustre (single server, 32 clients) Throughput (Kop/s) o decoupling data from 500 metadata 30x 30x o Lustre, Google FS, etc 300x 300x 50 100x 100x • IndexFS [ sc14 ] o dynamically partitioned 5 metadata middleware o orders of magnitude faster 0 than Lustre in metadata empty file creation file lookup file deletion Exa- scaling demands ever more decoupling Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 2
Compute-side server code 25,000 • BatchFS [ pdsw14 ] 16 servers, 64 clients 19,692 File Creates (Kop/s) 20,000 o decoupling clients from servers 15,000 o temporarily scale beyond 30X 30X the total number of servers 10,000 o very fast for a while and eventually clients 5,000 618 communicate with servers 0 to merge updates IndexFS BatchFS How much further can we delay & decouple merging ? Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 3
∆FS Goal • Want the peak Tput BatchFS demonstrated • Compel freedom from server synchronization o by eliminating all server machines o by dealing with issues rising from the absence of metadata servers o by not assuming an underlying PFS Scale beyond BatchFS Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 4
Agenda • DeltaFS design • Why no dedicated servers is not a problem Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 5
Middleware Design ∆FS is middleware spawned by each parallel app App P1 P2 P3 Pn … App App ∆FS App App obje bject ct st stor ore e st stor orin ing dat g data/ a/metada metadata ta Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 6
∆FS Overview FS defined by a set of snapshots stored as sets of FS snapshot napshot metadata logs and data objects / e / b c / d a list a l ist of of metad adata ata op ops b c b e Logic Lo ical al Vi View Ob Obje ject ct St Stor orag age Log Log Log Log Note: data objects not shown here rena ename me /d->/e >/e rmdi dir /c /c Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 7
System Model Reads input dataset from an existing FS snapshot Creates a new snapshot with output data inserted in input ut sn snapshot apshot a ne new sn snapshot apshot ready dy to be us used by fu futur ure apps produce duced by a previous vious app input create Lo Logi gica cal l Vi View App Ob Obje ject ct St Stor orag age Log Log Log Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 8
Key take-away • NO global namespace Each namespace is defined by the app and the logs loaded by it • NO false sharing Apps don’t access logs not needed by them • NO dedicated metadata servers App directly communicates with the storage to load/dump metadata logs Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 9
How logs are implemented ? • TableFS [ atc13 ] o namespace = a large dir entry table + embedded inodes • implemented as LS LSM-Tree ee (a collection of ordered B-Trees) • Each log object is a differential B-Tree (diff) o representing a set of recent updates (e.g. newly inserted/modified inodes) k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v Log k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 10
Why LSM-Tree is a good idea ? • Logs are 1 st – class data No need to replay logs to recover namespaces Near-zero cost of merging namespaces • Each log is self-indexed Scanning/reading within a single log is fast: O(logN) Scanning/reading a series of non-overlapping logs is as fast as a single log Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 11
Agenda • DeltaFS design • Why no dedicated servers is not a problem Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 12
P1: Do my apps need the FS to communicate/synchronize ? Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 13
Unrelated Apps W ork on different datasets and don’t communicate. / climate ocean App 2 App 1 pacific atlantic Don’t need the FS to communicate Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 14
Self-Coordinating Apps Use middleware to share faster & more efficiently Parallel Scientific App MPI MPI P1 P2 P3 File Don’t need the FS to communicate Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 15
Workflow Apps Externally coordinated by job schedulers / login_log user_profile job scheduler workflow engine Reducer Mapper movie_profile Iter3 Iter4 Don’t need the FS to communicate Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 16
Anonymous Synchronization e.g. Two app instances competing for mastership App 2 App 2 App 1 App 1 Lustre Zookeeper (ZAB), .LOCK .LOCK Paxos, Raft Turn to a mechanism outside the FS to coordinate Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 17
Anonymous Synchronization e.g. Two app instances competing for mastership App 2 App 2 App 1 App 1 Lustre Zookeeper (ZAB), .LOCK .LOCK Paxos, Raft Turn to a mechanism outside the FS to coordinate Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 18
P2: But I often use different programs to access data concurrently ! Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 19
User requested concurrent sharing Mon App attach ch ∆FS P1 P2 P3 Pn … Viz ∆FS ∆FS attach ch Link to ∆ FS middleware and attach to the primary parallel app Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 20
P3: Which snapshots to use ? Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 21
Which snapshots to use ? Option 1: rely on job schedulers to automate namespace propagation input=… job scheduler App output=… workflow engine input=… App_1 App output=… App_2 App Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 22
Which snapshots to use ? Option 2: ask external registries using search predicates pub ublish lish 1 App snapshot registry se sear arch 2 App_1 App coll llect 3 App_2 App Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 23
Finding snapshots is like searching a page using Google • Possible search predicates o find latest stable science code for my science o find latest recommended mesh model and cleaned input data o find latest vendor recommended HW libraries • Also, there can be multiple snapshot registries Allows programmable namespace composition Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 24
P4: What about potential conflicts among different snapshots ? Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 25
Unrelated Apps Work on different portions of the namespace / climate ocean App 2 App 1 pacific atlantic Won’t generate any conflicts Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 26
Workflow Apps Access the same dataset at different time / login_log user_profile job scheduler workflow engine Reducer Mapper movie_profile Iter3 Iter4 Won’t generate any conflicts Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 27
Self-Coordinating Apps Coded to be conflict-free Parallel Scientific App MPI MPI P1 P2 P3 File Won’t generate any conflicts Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 28
Namespace composition is fast if there is no conflict • Recall: near-zero cost of merging logs o better if those logs do not overlap with each other What if there are conflicts ? Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 29
Use domain knowledge /de deltaf tafs Conflicts resolved per app’s own file_1 file_2 reconciliation policy /de deltaf tafs /de deltaf tafs /de deltaf tafs file_1 file_1 file_2(b) file_2 file_2 file_1(a) file_2(a) file_1(b) input snapshot input snapshot possible resolution outcome Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 30
Use curators to remember conflict resolution results App So no duplicated resolutions by different apps a cur urato ator r in inhe herit rits s a pre- reso solve lved d na namespace space fr from om an n app anothe ano ther nam namespac space e curat cu ator or App App an n app dir irectly tly takes es na namespa spaces s a namesp a nam espace ace fr from 2 cu curato ators rs curat cu ator or Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 31
Conclusion • Strong scalability needs strong decoupling o exiting clients synch too often with servers o removing servers force us to rethink on what is necessary o need to try radically different model for shared storage Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 32
Recommend
More recommend