Easy and Instantaneous Processing for Data-Intensive Workflows Nan - - PowerPoint PPT Presentation

easy and instantaneous processing for data intensive
SMART_READER_LITE
LIVE PREVIEW

Easy and Instantaneous Processing for Data-Intensive Workflows Nan - - PowerPoint PPT Presentation

Easy and Instantaneous Processing for Data-Intensive Workflows Nan Dun, Kenjiro Taura, and Akinori Yonezawa Graduate School of Information Science and Technology The University of Tokyo Contact Email: dunnan@yl.is.s.u-tokyo.ac.jp November


slide-1
SLIDE 1

November 15th, 2010 MTAGS 2010 New Orleans, USA

Easy and Instantaneous Processing for Data-Intensive Workflows

Nan Dun, Kenjiro Taura, and Akinori Yonezawa Graduate School of Information Science and Technology The University of Tokyo Contact Email: dunnan@yl.is.s.u-tokyo.ac.jp

slide-2
SLIDE 2

Background

✤ More computing resources ✤ Desktops, clusters, clouds, and supercomputers ✤ More data-intensive applications ✤ Astronomy, bio-genome, medical science, etc. ✤ More domain researchers using distributed computing ✤ Know their workflows well, but with few system knowledge

2

slide-3
SLIDE 3

Motivation

3

Domain Researchers System People

  • Able to use resources

provided by universities, institutions, etc.

  • Know their apps well
  • Know little about system,
  • esp. distributed systems
  • Only able to control

resources they administrate

  • Know system well
  • Know little domain apps

Would you please help me running my applications? OK, but teach me about your applications first Actually, you can do it by yourself

  • n any machines!
slide-4
SLIDE 4

Outlines

✤ Brief description of our processing framework ✤ GXP parallel/distributed shell ✤ GMount distributed file system ✤ GXP Make workflow engine ✤ Experiments ✤ Practices in clusters and supercomputer ✤ From the view point of underlying data sharing, since our target is

data-intensive!

4

slide-5
SLIDE 5

Usage: How simple it is!

  • 1. Write workflow description in makefile
  • 2. Resource exploration (Start from one single node!)
  • $ gxpc use ssh clusterA clusterB

$ gxpc explore clusterA[[000-020]] clusterB[[100-200]]

  • 3. Deploy distributed file system
  • $ gmnt /export/on/each/node /file/system/mountpoint
  • 4. Run workflow
  • $ gxpc make -f makefile -j N_parallelism

5

slide-6
SLIDE 6

GXP Parallel/Distributed Shell

✤ GXP shell magic ✤ Install and start from one single node ✤ Implemented in Python,

no compilation

✤ Support various login channels,

e.g. SSH, RSH, TORQUE

✤ Efficiently issue command and invoke

processes on many nodes in parallel

6

SSH RSH TORQUE $gxpc e ls ls ls ls ls ls ls ls ls ls

slide-7
SLIDE 7

GMount Distributed File System

✤ Building block: SSHFS-MUX ✤ Mount multiple remote

directories to local one

✤ SFTP protocol over SSH/

Socket channel

✤ Parallel mount ✤ Use GXP shell to execute

  • n every nodes.

7

B D A C

$gmnt sshfsm B C D sshfsm A sshfsm A sshfsm A

slide-8
SLIDE 8

GMount (Cont.)

8

✤ GMount features esp. for

wide-area environments

✤ No centralized servers ✤ Locality-aware file lookup ✤ Efficient when

application has access locality

✤ New file created locally

Cluster A Cluster B Cluster D Cluster C

slide-9
SLIDE 9

Shared by GMount

GXP Make Workflow Engine

✤ Fully compatible with GNU Make ✤ Straightforward to write data-

  • riented applications

✤ Integrated in GXP shell ✤ Practical dispatching throughput

in wide-area

✤ 62 tasks/sec in InTrigger vs.

56 tasks/sec by Swift+Falkon in TeraGrid

9

B D A C

$gxp make

  • ut.dat: in.dat

run a.job run b.job run c.job run.d.job

a.job b.job c.job d.job

slide-10
SLIDE 10

GXP Make (Cont.)

✤ Why Make is Good? ✤ Straightforward to write data-oriented applications ✤ Expressive: embarrassly parallel, MapReduce, etc. ✤ Fault tolerance: continue at failure point ✤ Easy to debug: “make -n” option ✤ Concurrency control: “make -j” option ✤ Widely used and thus easy to learn

10

slide-11
SLIDE 11

Evaluation

11

✤ Experimental environments ✤ InTrigger multi-cluster platform ✤ 16 clusters, 400 nodes, 1600 CPU cores ✤ Connected by heterogeneous wide-area links ✤ HA8000 cluster system ✤ 512 nodes, 8192 cores ✤ Highly coupled network

slide-12
SLIDE 12

Evaluation

12

✤ Benchmark ✤ ParaMark ✤ Parallel metadata I/O

benchmark

✤ Real-world application ✤ Event recognition from

PubMed database

Medline XML

Text Extraction Protein Name Recognizer Enju Parser Sagae’s Dependency Parser Event Recognizer

Event Structure

slide-13
SLIDE 13

Task Characteristics

13

500 1,000 1,500 2,000 2,500 3,000 10 20 30 40 50 60 70 80 90 Processing Time (sec) Input Files File Size (KB) Processing Time File Size

slide-14
SLIDE 14

Experiments (Cont.)

14

✤ Comparison to another two different sharing approaches

M

NFS Raid

M M

N N N N N N

Master Site DSS

MDS Client

DSS

Client

DSS Gfarm Distributed File System SSHFS-MUX All-to-One Mount

slide-15
SLIDE 15

Summary of Data Sharing

15

Operations NFS SSHFS-MUX All-to-One Gfarm GMount

Metadata I/O

Central server Central server Metada Servers Localty-aware Operations Central server Central server Data Server Data Server

slide-16
SLIDE 16

T ransfer Rate in WAN

16

5 10 15 20 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M 8M 16M

Throughput (MB/sec) Block Size (Bytes) Gfarm SSHFSM (direct) SSHFS Iperf

slide-17
SLIDE 17

Workflow in LAN

✤ Single cluster using NFS and SSHFSM All-to-One, 158 tasks, 72

workers

17

42.5 85 750 1500 2250 3000 Parallelism Execution Time (sec) NFS SSHFSM All-to-One

slide-18
SLIDE 18

Workflow in WAN

✤ 11 clusters using SSHFSM All-to-One, 821 tasks, 584 workers

18

300 600 1250 2500 3750 5000 Parallelism Execution Time (sec) All Jobs Long Jobs

Long jobs dominate the execution time

slide-19
SLIDE 19

GMount vs. Gfarm

19

1E+00 1E+01 1E+02 1E+03 1E+04 1E+05 2 4 8 16

  • ps/sec

# of Concurrent Clients

Gfarm in LAN Gfarm in WAN GMount in WAN

Aggregate Metadata Performance 1 10 100 1000 10000 100000 2 4 8 16 MBs/sec # of Concurrent Clients

Gfarm Read Gfarm Write GMount Read GMount Write

Aggregate I/O Performance

slide-20
SLIDE 20

GMount vs. Gfarm (Cont.)

✤ 4 clusters using Gfarm and GMount, 159 tasks, 252 workers

20

87.5 175 875 1750 2625 3500 Parallelism Execution Time (sec) GMount Gfarm 15% speed up Many new files are created

slide-21
SLIDE 21

GMount vs. Gfarm (Cont.)

21

0.01 0.1 1 10 100 1,000 Elasped Time (second) Small Jobs only Create New Empty Files on “/” directory “Create” Jobs on GMount “Create” Jobs on Gfarm

slide-22
SLIDE 22

ha8000

On Supercomputer

22

Lustre FS

ha8000 ha8000 ha8000 ha8000 ha8000

gateway gateway gateway

clokoxxx clokoxxx clokoxxx clokoxxx hongo

cloko cluster hongo cluster hongo cluster

GXP Make Master

sshfs mount

slide-23
SLIDE 23

On Supercomputer (Cont.)

23

1,500 3,000 4,500 6,000 7,500 9,000 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000

  • Allocated 6 hours time slot
  • External workers appended
slide-24
SLIDE 24

Conclusion

✤ GXP shell + GMount + GXP Make ✤ Simplicity: no need stack of middleware, wide compatibility ✤ Easiness: effortlessly and rapidly build ✤ User-level: no privilege required, by any users ✤ Adaptability: uniform interface for clusters, clouds, and

supercomputer

✤ Scalability: scales to hundreds of nodes ✤ Performance: high throughput in wide-area environments

24

slide-25
SLIDE 25

Future Work

✤ Improve GMount for better create performance ✤ Improve GXP Make for better and smart scheduling ✤ Better user interface: configuration ---> workflow execution ✤ Further reduce installation cost ✤ Implement SSHFS-MUX in Python

25

slide-26
SLIDE 26

Open Source Software

✤ SSHFS-MUX/GMount ✤ http://sshfsmux.googlecode.com/ ✤ GXP parallel/distributed shell ✤ http://gxp.sourceforge.net/ ✤ ParaMark ✤ http://paramark.googlecode.com/

26

slide-27
SLIDE 27

Questions?