DIESEL: A Dataset-Based Distributed Storage and Caching System for Large-Scale Deep Learning Training Lipeng Wang 1 , Songgao Ye 2 , Baichen Yang 1 , Youyou Lu 3 , Hequan Zhang 2 , Shengen Yan 2 , and Qiong Luo 1 1 Hong Kong University of Science and Technology 2 SenseTime Research 3 Tsinghua University
Deep Learning training (DLT): an important workload on clusters • Widely deployed in many areas • Image Classification • Object Detection • Natural Language Processing • Recommender Systems • Data intensive • ImageNet-1K: • 1.28 million images • Open Image: • 9 million images • Expensive accelerators, i.e., GPUs Training the well-known ResNet-50 model on the ImageNet-1K dataset takes more than 30 hours in a cluster 2
Deep Learning training (DLT): an important workload on clusters • Widely deployed in many areas • Image Classification • Object Detection • Natural Language Processing • Recommender Systems • Data intensive • ImageNet-1K: • 1.28 million images • Open Image: • 9 million images • Expensive accelerators, i.e., GPUs Training the well-known ResNet-50 model on the ImageNet-1K dataset takes more than 30 hours in a cluster How to reduce the total training time? 3
File size distribution and training time breakdown Image size and type distribution on a The data access time takes a significant real-world production cluster: part in the total training time • Most files are smaller than 128KB 4
File size distribution and training time breakdown Image size and type distribution on a The data access time takes a significant real-world production cluster: part in the total training time • Most files are smaller than 128KB Reduce the data access time! 5
File access procedure in DLT tasks on computer clusters time Get file list of a dataset Shuffle the file list Image processing: Crop, flip, etc. Read a batch of shuffled files forward pass Car? epoch House? Read a batch of shuffled files Cat? Horse? ... backward propagation Shuffle the file list Training Distributed framework storage and cache 6
File access procedure in DLT tasks on computer clusters time Get file list of a dataset 1 2 Shuffle the file list Image processing: Crop, flip, etc. Read a batch of shuffled files forward pass Car? epoch House? Read a batch of shuffled files Cat? Horse? ... backward propagation Shuffle the file list Training Distributed framework storage and cache 7
Three problems in existing storage and caching systems Metadata access is not scalable on P1: Large number of small existing systems files Affects all DLT tasks on a cluster & P2: Node failure in global slow to recover cache P3: Shuffled file access pattern Slow read speed 8
Problem 1: metadata access does not scale on existing systems Metadata access (e.g., list names) Alternative 1 Alternative 2 Metadata servers Metadata access (e.g., get size) Metadata & Data ... ... servers Data servers Distributed Distributed GPU Nodes GPU Nodes Storage Storage Separated metadata servers and Distribute metadata and data on data servers all storage servers (e.g., Lustre, GFS) (e.g., Ceph, GlusterFS) • Existing storage systems have poor scalability on metadata access 9
Problem 2: global caching systems are vulnerable to node failures Dataset 1 Dataset 2 F1 a cache in F5 e task 1 F9 i cache in F2 b j F6 f cache in F3 c task 2 F7 g cache in d F4 F8 h • Task 1 works on dataset 1 • Task 2 works on dataset 2 10
Problem 2: global caching systems are vulnerable to node failures Dataset 1 Dataset 2 F1 a Cache miss cache in F5 e on F2, F6 task 1 F9 i cache in F2 b j F6 f Cache miss cache in F3 c on b, f, j task 2 F7 g cache in Cache miss d F4 on b, f, j F8 h • Task 1 works on dataset 1 • Task 2 works on dataset 2 • A node failure in task 1 will affect both task 1 and task 2! 11
Problem 2: global caching systems are vulnerable to node failures Dataset 1 Dataset 2 F1 a Cache miss cache in F5 e on F2, F6 task 1 F9 i cache in F2 b j F6 f Cache miss cache in F3 c on b, f, j task 2 F7 g cache in Cache miss d F4 on b, f, j F8 h • Task 1 works on dataset 1 • Task 2 works on dataset 2 • A node failure in task 1 will affect both task 1 and task 2! • The cache node recovery takes a long time due to small file reads! 12
Problem 3: shuffled access of small files is slow file access pattern in DLT tasks read speed comparison on different read unit size ~25x • The shuffled access pattern on small files hurts the read performance a lot 13
Proposed solutions in DIESEL 1 Distributed in-memory Metadata access is not P1: Large number of metadata server & scalable on existing systems small files metadata snapshot 2 Task-grained caching P2: Node failure in global Affects all DLT tasks system cache & slow to recover 3 P3: Shuffled file access Chunk-based shuffle Slow read speed pattern method 14
DIESEL overview • Distributed in- memory key/value server as metadata server • Metadata snapshot • Task-grained distributed cache • POSIX-compliant interface 15
The first step: write files into DIESEL • Files are merged into large chunks • Metadata is saved in the head of each chunk as well as in an in-memory key/value server 16
Metadata storage in DIESEL Why need to store metadata with data chunks? 17
Metadata storage in DIESEL Why need to store metadata with data chunks? The in-memory key/value server may fail : 18
Metadata storage in DIESEL Why need to store metadata with data chunks? The in-memory key/value server may fail : • Lost recently written entries • Lost all entries due to power failure 19
Reconstruct key/value pairs from data chunks Reconstruct key/value pairs from data chunks 20
Metadata snapshot – download from DIESEL Distributed Storage (e.g., Lustre) K/V DIESEL server Server Key/value in disks Key/value pairs 1 • Get metadata of a dataset • Save metadata to a disk file on Dataset: update time, … distributed storage File: ChunkID , offset, length, … Key/value in Hashmaps 21
Metadata snapshot – load from disks Distributed Storage (e.g., Lustre) K/V DIESEL server Server Key/value in disks Key/value pairs 2 3 load metadata from disk file get update time,… • Load metadata from disk file • Check the update timestamp Key/value in Hashmaps 22
Metadata snapshot – bypass the metadata server to retrieve files Distributed Storage (e.g., Lustre) K/V DIESEL server Server Key/value pairs 4 ChunkID , offset, length,… • Lookup metadata locally, bypass the metadata server Look up metadata • Read data chunks File access 23
Task-grained distributed caching system Distributed Storage DIESEL deploys a task-grained distributed cache across the GPU nodes of a DLT task: • Isolate node failure Training task A Training task B Training task B A GPU Node • Reduce # of network connections Caching server Caching server Caching server • Lifetime follows the DLT task Training task A Training task B Training task B Caching server Caching server Caching server Task A Task B 24
Chunk-based shuffle method • In DLT tasks, the file access order does not matter, as long as it is random • DIESEL generates a shuffled file list • Convert individual file reads into large chunk reads • Small memory footprint Cache miss only on the first three files Cache miss only on the first three files 25
Experimental Setup Dataset: • ImageNet-1K (1.28million images, ~150GB) Framework: • PyTorch Models: • AlexNet • VGG-11 • ResNet-18 • ResNet-50 26
Evaluation on file writing • DIESEL is faster than the Lustre and Memcached on file writing • On 4KB file size, DIESEL is about 200x and 360x faster than the Memcached and Lustre, respectively • On 128KB file size, DIESEL is about 17x and 120x faster than the Memcached and Lustre, respectively 27
Evaluation on metadata access and metadata snapshot • Increasing the number of DIESEL server will increase the metadata access performance when the metadata snapshot is disabled • With the metadata snapshot enabled, the metadata access throughput increases linearly with the number of workers • DIESEL is faster than Lustre and XFS-NVME on metadata query response time 28
Evaluation on task-grained distributed cache • Task-grained distributed cache achieves better performance than existing global in-memory caching system • The task-grained distributed cache’s “Cold - booting” time is shorter than Memcached’s node recovery time 29
Evaluation on chunk-based shuffle method • Chunk-based shuffle method has higher read bandwidth than the Lustre filesystem • On 4KB file reads, DIESEL is more than 50x faster than the Lustre • On 128KB file reads, DIESEL is more than 4x faster than the Lustre 30
Evaluation on real-world DLT tasks • DIESEL reduces about 15%-27% time in end-to-end training tasks 31
Recommend
More recommend