Easy and Instantaneous Processing for Data-Intensive Workflows Nan Dun, Kenjiro Taura, and Akinori Yonezawa Graduate School of Information Science and Technology The University of Tokyo Contact Email: dunnan@yl.is.s.u-tokyo.ac.jp November 15th, 2010 MTAGS 2010 New Orleans, USA
Background ✤ More computing resources ✤ Desktops, clusters, clouds, and supercomputers ✤ More data-intensive applications ✤ Astronomy, bio-genome, medical science, etc. ✤ More domain researchers using distributed computing ✤ Know their workflows well, but with few system knowledge 2
Motivation Domain Researchers System People Would you • Able to use resources please help me • Only able to control provided by universities, running my institutions, etc. resources they administrate applications? • Know their apps well • Know system well • Know little about system, • Know little domain apps OK, but teach me Actually, you can esp. distributed systems do it by yourself about your on any machines! applications first 3
Outlines ✤ Brief description of our processing framework ✤ GXP parallel/distributed shell ✤ GMount distributed file system ✤ GXP Make workflow engine ✤ Experiments ✤ Practices in clusters and supercomputer ✤ From the view point of underlying data sharing, since our target is data-intensive ! 4
Usage: How simple it is! 1. Write workflow description in makefile 2. Resource exploration (Start from one single node!) • $ gxpc use ssh clusterA clusterB $ gxpc explore clusterA[[000-020]] clusterB[[100-200]] 3. Deploy distributed file system • $ gmnt /export/on/each/node /file/system/mountpoint 4. Run workflow • $ gxpc make -f makefile -j N_parallelism 5
GXP Parallel/Distributed Shell ✤ GXP shell magic ls ls ✤ Install and start from one single node ls SSH ✤ Implemented in Python, ls $gxpc e ls ls RSH no compilation ls ✤ Support various login channels, e.g. SSH, RSH, TORQUE ls TORQUE ls ✤ Efficiently issue command and invoke ls processes on many nodes in parallel 6
GMount Distributed File System ✤ Building block: SSHFS-MUX $gmnt ✤ Mount multiple remote directories to local one ✤ SFTP protocol over SSH/ sshfsm A sshfsm A Socket channel B C ✤ Parallel mount D ✤ Use GXP shell to execute A sshfsm A on every nodes. sshfsm B C D 7
GMount (Cont.) Cluster A ✤ GMount features esp. for wide-area environments Cluster B ✤ No centralized servers ✤ Locality-aware file lookup ✤ Efficient when application has access locality ✤ New file created locally Cluster C Cluster D 8
GXP Make Workflow Engine out.dat: in.dat ✤ Fully compatible with GNU Make run a.job $gxp make run b.job run c.job ✤ Straightforward to write data- run.d.job oriented applications ✤ Integrated in GXP shell a.job c.job B C ✤ Practical dispatching throughput in wide-area D Shared by GMount ✤ 62 tasks/sec in InTrigger vs. d.job 56 tasks/sec by Swift+Falkon in TeraGrid A b.job 9
GXP Make (Cont.) ✤ Why Make is Good? ✤ Straightforward to write data-oriented applications ✤ Expressive: embarrassly parallel, MapReduce, etc. ✤ Fault tolerance: continue at failure point ✤ Easy to debug: “make -n” option ✤ Concurrency control: “make -j” option ✤ Widely used and thus easy to learn 10
Evaluation ✤ Experimental environments ✤ InTrigger multi-cluster platform ✤ 16 clusters, 400 nodes, 1600 CPU cores ✤ Connected by heterogeneous wide-area links ✤ HA8000 cluster system ✤ 512 nodes, 8192 cores ✤ Highly coupled network 11
Evaluation ✤ Benchmark Medline XML ✤ ParaMark Text Extraction ✤ Parallel metadata I/O benchmark Sagae’s Protein Name Enju Parser Dependency Parser Recognizer ✤ Real-world application ✤ Event recognition from Event Recognizer PubMed database Event Structure 12
Task Characteristics Processing Time File Size 3,000 90 80 2,500 Processing Time (sec) 70 File Size (KB) 2,000 60 50 1,500 40 1,000 30 20 500 10 0 0 Input Files 13
Experiments (Cont.) ✤ Comparison to another two different sharing approaches Gfarm Distributed File System SSHFS-MUX All-to-One Mount Master Site Client MDS NFS Raid DSS M M M DSS N N N N N N Client DSS 14
Summary of Data Sharing SSHFS-MUX Operations NFS Gfarm GMount All-to-One Central Metada Localty-aware Central server Metadata server Servers Operations Central server Central server Data Server Data Server I/O 15
T ransfer Rate in WAN Gfarm SSHFSM (direct) SSHFS Iperf 20 Throughput (MB/sec) 15 10 5 0 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M 8M 16M Block Size (Bytes) 16
Workflow in LAN ✤ Single cluster using NFS and SSHFSM All-to-One, 158 tasks, 72 workers 85 NFS SSHFSM All-to-One Parallelism 42.5 0 0 750 1500 2250 3000 Execution Time (sec) 17
Workflow in WAN ✤ 11 clusters using SSHFSM All-to-One, 821 tasks, 584 workers 600 All Jobs Long Jobs Parallelism 300 Long jobs dominate the execution time 0 0 1250 2500 3750 5000 Execution Time (sec) 18
GMount vs. Gfarm Aggregate Metadata Performance Aggregate I/O Performance 1E+05 100000 1E+04 10000 MBs/sec 1E+03 1000 ops/sec 1E+02 100 Gfarm Read Gfarm in LAN Gfarm Write 1E+01 10 Gfarm in WAN GMount Read GMount in WAN GMount Write 1E+00 1 2 4 8 16 2 4 8 16 # of Concurrent Clients # of Concurrent Clients 19
GMount vs. Gfarm (Cont.) ✤ 4 clusters using Gfarm and GMount, 159 tasks, 252 workers 175 Many new files are created GMount Gfarm Parallelism 87.5 15% speed up 0 0 875 1750 2625 3500 Execution Time (sec) 20
GMount vs. Gfarm (Cont.) 1,000 “Create” Jobs on GMount Elasped Time (second) “Create” Jobs on Gfarm 100 10 1 0.1 0.01 Small Jobs only Create New Empty Files on “/” directory 21
On Supercomputer hongo cluster GXP Make Master ha8000 ha8000 hongo ha8000 ha8000 sshfs mount hongo cluster ha8000 ha8000 gateway clokoxxx clokoxxx gateway Lustre FS clokoxxx gateway clokoxxx cloko cluster 22
On Supercomputer (Cont.) 9,000 7,500 6,000 • Allocated 6 hours time slot 4,500 • External workers appended 3,000 1,500 0 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 23
Conclusion ✤ GXP shell + GMount + GXP Make ✤ Simplicity: no need stack of middleware, wide compatibility ✤ Easiness: effortlessly and rapidly build ✤ User-level: no privilege required, by any users ✤ Adaptability: uniform interface for clusters, clouds, and supercomputer ✤ Scalability: scales to hundreds of nodes ✤ Performance: high throughput in wide-area environments 24
Future Work ✤ Improve GMount for better create performance ✤ Improve GXP Make for better and smart scheduling ✤ Better user interface: configuration ---> workflow execution ✤ Further reduce installation cost ✤ Implement SSHFS-MUX in Python 25
Open Source Software ✤ SSHFS-MUX/GMount ✤ http://sshfsmux.googlecode.com/ ✤ GXP parallel/distributed shell ✤ http://gxp.sourceforge.net/ ✤ ParaMark ✤ http://paramark.googlecode.com/ 26
Questions?
Recommend
More recommend