Storage and Caching System for Large-Scale Deep Learning Training - PowerPoint PPT Presentation

DIESEL: A Dataset-Based Distributed Storage and Caching System for Large-Scale Deep Learning Training Lipeng Wang 1 , Songgao Ye 2 , Baichen Yang 1 , Youyou Lu 3 , Hequan Zhang 2 , Shengen Yan 2 , and Qiong Luo 1 1 Hong Kong University of Science and Technology 2 SenseTime Research 3 Tsinghua University

Deep Learning training (DLT): an important workload on clusters • Widely deployed in many areas • Image Classification • Object Detection • Natural Language Processing • Recommender Systems • Data intensive • ImageNet-1K: • 1.28 million images • Open Image: • 9 million images • Expensive accelerators, i.e., GPUs Training the well-known ResNet-50 model on the ImageNet-1K dataset takes more than 30 hours in a cluster 2

Deep Learning training (DLT): an important workload on clusters • Widely deployed in many areas • Image Classification • Object Detection • Natural Language Processing • Recommender Systems • Data intensive • ImageNet-1K: • 1.28 million images • Open Image: • 9 million images • Expensive accelerators, i.e., GPUs Training the well-known ResNet-50 model on the ImageNet-1K dataset takes more than 30 hours in a cluster How to reduce the total training time? 3

File size distribution and training time breakdown Image size and type distribution on a The data access time takes a significant real-world production cluster: part in the total training time • Most files are smaller than 128KB 4

File size distribution and training time breakdown Image size and type distribution on a The data access time takes a significant real-world production cluster: part in the total training time • Most files are smaller than 128KB Reduce the data access time! 5

File access procedure in DLT tasks on computer clusters time Get file list of a dataset Shuffle the file list Image processing: Crop, flip, etc. Read a batch of shuffled files forward pass Car? epoch House? Read a batch of shuffled files Cat? Horse? ... backward propagation Shuffle the file list Training Distributed framework storage and cache 6

File access procedure in DLT tasks on computer clusters time Get file list of a dataset 1 2 Shuffle the file list Image processing: Crop, flip, etc. Read a batch of shuffled files forward pass Car? epoch House? Read a batch of shuffled files Cat? Horse? ... backward propagation Shuffle the file list Training Distributed framework storage and cache 7

Three problems in existing storage and caching systems Metadata access is not scalable on P1: Large number of small existing systems files Affects all DLT tasks on a cluster & P2: Node failure in global slow to recover cache P3: Shuffled file access pattern Slow read speed 8

Problem 1: metadata access does not scale on existing systems Metadata access (e.g., list names) Alternative 1 Alternative 2 Metadata servers Metadata access (e.g., get size) Metadata & Data ... ... servers Data servers Distributed Distributed GPU Nodes GPU Nodes Storage Storage Separated metadata servers and Distribute metadata and data on data servers all storage servers (e.g., Lustre, GFS) (e.g., Ceph, GlusterFS) • Existing storage systems have poor scalability on metadata access 9

Problem 2: global caching systems are vulnerable to node failures Dataset 1 Dataset 2 F1 a cache in F5 e task 1 F9 i cache in F2 b j F6 f cache in F3 c task 2 F7 g cache in d F4 F8 h • Task 1 works on dataset 1 • Task 2 works on dataset 2 10

Problem 2: global caching systems are vulnerable to node failures Dataset 1 Dataset 2 F1 a Cache miss cache in F5 e on F2, F6 task 1 F9 i cache in F2 b j F6 f Cache miss cache in F3 c on b, f, j task 2 F7 g cache in Cache miss d F4 on b, f, j F8 h • Task 1 works on dataset 1 • Task 2 works on dataset 2 • A node failure in task 1 will affect both task 1 and task 2! 11

Problem 2: global caching systems are vulnerable to node failures Dataset 1 Dataset 2 F1 a Cache miss cache in F5 e on F2, F6 task 1 F9 i cache in F2 b j F6 f Cache miss cache in F3 c on b, f, j task 2 F7 g cache in Cache miss d F4 on b, f, j F8 h • Task 1 works on dataset 1 • Task 2 works on dataset 2 • A node failure in task 1 will affect both task 1 and task 2! • The cache node recovery takes a long time due to small file reads! 12

Problem 3: shuffled access of small files is slow file access pattern in DLT tasks read speed comparison on different read unit size ~25x • The shuffled access pattern on small files hurts the read performance a lot 13

Proposed solutions in DIESEL 1 Distributed in-memory Metadata access is not P1: Large number of metadata server & scalable on existing systems small files metadata snapshot 2 Task-grained caching P2: Node failure in global Affects all DLT tasks system cache & slow to recover 3 P3: Shuffled file access Chunk-based shuffle Slow read speed pattern method 14

DIESEL overview • Distributed in- memory key/value server as metadata server • Metadata snapshot • Task-grained distributed cache • POSIX-compliant interface 15

The first step: write files into DIESEL • Files are merged into large chunks • Metadata is saved in the head of each chunk as well as in an in-memory key/value server 16

Metadata storage in DIESEL Why need to store metadata with data chunks? 17

Metadata storage in DIESEL Why need to store metadata with data chunks? The in-memory key/value server may fail : 18

Metadata storage in DIESEL Why need to store metadata with data chunks? The in-memory key/value server may fail : • Lost recently written entries • Lost all entries due to power failure 19

Reconstruct key/value pairs from data chunks Reconstruct key/value pairs from data chunks 20

Metadata snapshot – download from DIESEL Distributed Storage (e.g., Lustre) K/V DIESEL server Server Key/value in disks Key/value pairs 1 • Get metadata of a dataset • Save metadata to a disk file on Dataset: update time, … distributed storage File: ChunkID , offset, length, … Key/value in Hashmaps 21

Metadata snapshot – load from disks Distributed Storage (e.g., Lustre) K/V DIESEL server Server Key/value in disks Key/value pairs 2 3 load metadata from disk file get update time,… • Load metadata from disk file • Check the update timestamp Key/value in Hashmaps 22

Metadata snapshot – bypass the metadata server to retrieve files Distributed Storage (e.g., Lustre) K/V DIESEL server Server Key/value pairs 4 ChunkID , offset, length,… • Lookup metadata locally, bypass the metadata server Look up metadata • Read data chunks File access 23

Task-grained distributed caching system Distributed Storage DIESEL deploys a task-grained distributed cache across the GPU nodes of a DLT task: • Isolate node failure Training task A Training task B Training task B A GPU Node • Reduce # of network connections Caching server Caching server Caching server • Lifetime follows the DLT task Training task A Training task B Training task B Caching server Caching server Caching server Task A Task B 24

Chunk-based shuffle method • In DLT tasks, the file access order does not matter, as long as it is random • DIESEL generates a shuffled file list • Convert individual file reads into large chunk reads • Small memory footprint Cache miss only on the first three files Cache miss only on the first three files 25

Experimental Setup Dataset: • ImageNet-1K (1.28million images, ~150GB) Framework: • PyTorch Models: • AlexNet • VGG-11 • ResNet-18 • ResNet-50 26

Evaluation on file writing • DIESEL is faster than the Lustre and Memcached on file writing • On 4KB file size, DIESEL is about 200x and 360x faster than the Memcached and Lustre, respectively • On 128KB file size, DIESEL is about 17x and 120x faster than the Memcached and Lustre, respectively 27

Evaluation on metadata access and metadata snapshot • Increasing the number of DIESEL server will increase the metadata access performance when the metadata snapshot is disabled • With the metadata snapshot enabled, the metadata access throughput increases linearly with the number of workers • DIESEL is faster than Lustre and XFS-NVME on metadata query response time 28

Evaluation on task-grained distributed cache • Task-grained distributed cache achieves better performance than existing global in-memory caching system • The task-grained distributed cache’s “Cold - booting” time is shorter than Memcached’s node recovery time 29

Evaluation on chunk-based shuffle method • Chunk-based shuffle method has higher read bandwidth than the Lustre filesystem • On 4KB file reads, DIESEL is more than 50x faster than the Lustre • On 128KB file reads, DIESEL is more than 4x faster than the Lustre 30

Evaluation on real-world DLT tasks • DIESEL reduces about 15%-27% time in end-to-end training tasks 31

Storage and Caching System for Large-Scale Deep Learning Training - PowerPoint PPT Presentation

DIESEL: A Dataset-Based Distributed Storage and Caching System for Large-Scale Deep Learning Training Lipeng Wang 1 , Songgao Ye 2 , Baichen Yang 1 , Youyou Lu 3 , Hequan Zhang 2 , Shengen Yan 2 , and Qiong Luo 1 1 Hong Kong University of Science

Agenda Caching Caching Gitlab Demo Caching Demos Mirroring Caching Limitations Manual

Web Proxy Web Proxy Caching Caching Caching Web Proxy Web Proxy Caching By Miquel Company

Web Caching and Content Delivery Web Caching and Content Delivery Caching for a Better Web

Cooperative Web Caching Cooperative Web Caching Cooperative Caching Cooperative Caching

Web Caching based on: Web Caching , Geoff Huston Web Caching and Zipf-like Distributions:

Scaling Your Cache & Caching at Scale Alex Miller @puredanger Mission Why does caching

1 Harvest Harvest- -Style ICP Hierarchies Style ICP Hierarchies Issues for Cache Hierarchies

Web Caching Web Caching and wireless networks Next generation Wireless Networks Helsinki

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

Storage Management and Caching in PAST, a Large-scale, Persistent Peer-to-peer Storage Utility

I/O Caching and Page Replacement I/O Caching and Page Replacement Memory/Storage Hierarchy 101

1 Web Traffic Characterization Zipf Web Traffic Characterization Zipf [Breslau/Cao99] and

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

Temporal Temporal Radiance Caching Radiance Caching Pascal Gautron R&D Engineer Thomson

Serving Photos at Scaaale : Caching and Storage An Analysis of Facebook Photo Caching. Huang et

Outline PAST goals Storage management and caching PAST api in PAST File storage

Intro & STAR DBs Overview Dmitry Arkhipkin NPPS group meeting 2019-06-05 STAR @ RHIC STAR

Multicom FlightAwares Alert Delivery System Mary Ryan Gilmore TCL Conference 2019 What

Test%Driven+Data+Modeling+With+Graphs + Twi7er:+@ianSrobinson+ #neo4j+ + Outline+

I n p u t sanitization); drop table slides New attacks and countermeasures: SQL

Intro to Ubik Mark Vitale 20 June 2019 2019 OpenAFS Workshop What is ubik? A software

MicroBooNE DAQ Experience Eric Church, PNNL SBN/DUNE DAQ Mee6ng

Big memory Scale-in vs. Scale-out Niklas Bjrkman VP Technology, Starcounter Simplicity and

How to Run the Scripts for GP2 Dr. Chris Mayfield Department of Computer Science James Madison

Storage and Caching System for Large-Scale Deep Learning Training - PowerPoint PPT Presentation

DIESEL: A Dataset-Based Distributed Storage and Caching System for Large-Scale Deep Learning Training Lipeng Wang 1 , Songgao Ye 2 , Baichen Yang 1 , Youyou Lu 3 , Hequan Zhang 2 , Shengen Yan 2 , and Qiong Luo 1 1 Hong Kong University of Science

Agenda Caching Caching Gitlab Demo Caching Demos Mirroring Caching Limitations Manual

Web Proxy Web Proxy Caching Caching Caching Web Proxy Web Proxy Caching By Miquel Company

Web Caching and Content Delivery Web Caching and Content Delivery Caching for a Better Web

Cooperative Web Caching Cooperative Web Caching Cooperative Caching Cooperative Caching

Web Caching based on: Web Caching , Geoff Huston Web Caching and Zipf-like Distributions:

Scaling Your Cache &amp; Caching at Scale Alex Miller @puredanger Mission Why does caching

1 Harvest Harvest- -Style ICP Hierarchies Style ICP Hierarchies Issues for Cache Hierarchies

Web Caching Web Caching and wireless networks Next generation Wireless Networks Helsinki

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

Storage Management and Caching in PAST, a Large-scale, Persistent Peer-to-peer Storage Utility

I/O Caching and Page Replacement I/O Caching and Page Replacement Memory/Storage Hierarchy 101

1 Web Traffic Characterization Zipf Web Traffic Characterization Zipf [Breslau/Cao99] and

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

Temporal Temporal Radiance Caching Radiance Caching Pascal Gautron R&amp;D Engineer Thomson

Serving Photos at Scaaale : Caching and Storage An Analysis of Facebook Photo Caching. Huang et

Outline PAST goals Storage management and caching PAST api in PAST File storage

Intro &amp; STAR DBs Overview Dmitry Arkhipkin NPPS group meeting 2019-06-05 STAR @ RHIC STAR

Multicom FlightAwares Alert Delivery System Mary Ryan Gilmore TCL Conference 2019 What

Test%Driven+Data+Modeling+With+Graphs + Twi7er:+@ianSrobinson+ #neo4j+ + Outline+

I n p u t sanitization); drop table slides New attacks and countermeasures: SQL

Intro to Ubik Mark Vitale 20 June 2019 2019 OpenAFS Workshop What is ubik? A software

MicroBooNE DAQ Experience Eric Church, PNNL SBN/DUNE DAQ Mee6ng

Big memory Scale-in vs. Scale-out Niklas Bjrkman VP Technology, Starcounter Simplicity and

How to Run the Scripts for GP2 Dr. Chris Mayfield Department of Computer Science James Madison

Scaling Your Cache & Caching at Scale Alex Miller @puredanger Mission Why does caching

Temporal Temporal Radiance Caching Radiance Caching Pascal Gautron R&D Engineer Thomson

Intro & STAR DBs Overview Dmitry Arkhipkin NPPS group meeting 2019-06-05 STAR @ RHIC STAR