abdulrahman azab abdulrahman azab uis no
play

Abdulrahman Azab abdulrahman.azab@uis.no 1 What is Grid? Grid - PowerPoint PPT Presentation

Abdulrahman Azab abdulrahman.azab@uis.no 1 What is Grid? Grid computing is concerned with coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations. Ian Foster & Karl Kesselman , 2001. VO2


  1. Abdulrahman Azab abdulrahman.azab@uis.no 1

  2. What is Grid? “Grid computing is concerned with coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations.” Ian Foster & Karl Kesselman , 2001. VO2 VO1 B1 B3 B2 VO3 2

  3. What is Cloud? 3

  4. The Kiss Rule Keep it simple, stupid! 4

  5. Grid vs Cloud • Grid I need a Scientific Linux with 2GB RAM! I have scientific linux With 3 GB Ram Manager(s) Take Hero Resource :Hero 5 User : Ali

  6. Grid vs Cloud • Cloud I need 3 high-CPU windows machines for 2 weeks Available for 1000$ 6

  7. Computational Grid vs. Computational Cloud Computational Computational Grid Cloud Provided service Computational power Amount of concurrent requests Limited Massive Transparency Not required Required Scalability Limited High I don’t care. Both are Distributed computing VO2 VO1 7 VO3

  8. Challenges • Many, but we consider: 1. Stability with scalability 2. System transparency 8

  9. Stability with Scalability • Stability Maintaining throughput under failures • Scalability Ability to add more nodes • Stability with scalability Maintaining throughput under failure with bigger Environment - Achieve load balancing - Avoid job starvation 9

  10. How? • Optimized machine organization • Efficient job scheduling • Efficient fault tolerance 10

  11. Machine organization • Flat (gLite, Condor, Globus,…) Manager(s) 11

  12. Machine organization • Flat (gLite, Condor, Globus,…) 12

  13. Machine organization • Flat (NorduGrid, HIMAN, XtreemOS) 13

  14. Machine organization • Flat 14

  15. Machine organization • Hierarchical (UNICORE, GridWay, BOINC,…) 15

  16. Machine organization • Hierarchical (UNICORE, GridWay, BOINC,…) 16

  17. Machine organization • Interconnected (Condor (flocking), DEISA, EGEE, NorduGrid) 17

  18. Machine organization • Interconnected (Condor (flocking), DEISA, EGEE, NorduGrid) 18

  19. Proposal 19

  20. Machine Organization: Cell 20

  21. Scheduling: Cooperative • Minimize scheduling overhead using Fuzzy logic VO3 VO2 VO1 Goto vo2 or vo3 VO4 VO5 21

  22. Worker Failures Failure 1 Failure 2 W1 Checkpoints 1 2 3 4 5 6 7 8 Last Update W2 W3 W4 W5 W1 W2 W3 W4 22 W5

  23. Broker Failures 23

  24. Broker Failures 24

  25. Broker Failures 25

  26. Simulation Model: PeerSim Service Allocator Broker Broker Broker Protocol Broker Protocol Broker Overlay Grid CD Protocol Grid CD Protocol Regular Node Regular Node Regular Node Regular Node Allocation Protocol Allocation Protocol Allocation Protocol Allocation Protocol Grid CD Protocol Grid CD Protocol Grid CD Protocol Grid CD Protocol Idle Protocol Idle Protocol Idle Protocol Idle Protocol 26

  27. Performance Evaluation • Validity of the stored resource information. • Efficiency of service allocation. • Impact of broker failure on resource information updating. N  Total Grid size, M  Number of VOs 27

  28. Performance Evaluation • Broker Overlay Topologies Hyper-Cube Ring K 1 2 1 2 K 1 2 K 1 2 K Wire- k -out Fully connected 28

  29. Validity of the stored resource information • The deviation of the reading time values of RIDBs stored in the resource information data set, from the current cycle in a broker, with the simulation cycles. • The deviation value for cycle (c): 29

  30. Validity of the stored resource information N = 100, M = 20 N = 500, M = 100 (log scale) 30

  31. Efficiency of Job Allocation • One broker periodical allocation. 31

  32. Impact of Broker Failures on Resource Information Updating (N = 500, M = 100) Ring Topology 32

  33. System Transparency System 33

  34. Challenge • To submit jobs to a Grid system you need to learn how to: 1. Prepare your input files 2. Write a detailed submission script. 3. Submit your jobs through the front end. 4. Monitor the execution. 5. Collect the results. Example for 2: condor_submit Do scientists have time for this ? 34

  35. Current solutions • Grid portals (Web-based gateways) WebSphere, WebLogic, GridSphere, GridPortlets,..  Useful for manual submission. In many cases, it is required to perform job submission automatically from a user code. Px 35

  36. Current solutions • Web services Birdbath (condor), GRAM (Globus), GridSAM, .. • APIs DRMAA, SAGA, HiLA, CondorAPI, GridR, ..  The programming language has to support the technology and the user must have the proper experience. This is not the case for many low level special purpose languages and most of the scientists 36

  37. Our Solution: GAFSI • Grid Access File System Interface submission and management of grid jobs is carried out by executing simple read() and write() file system commands.  This technique allows all categories of users to submit and manage grid jobs both manually and from their codes which may be written in any language. Demo 37

  38. GAFSI-File sharing File name: Job$Cluster$R$memory1024$Condor$start 2 3 \<GAFSI‐S Watch‐path> GAFSI 6 5 Condor pool Condor 1 4 Condor_schedd File Sharing 7 1 UNICORE 4 5 UCC File Sharing 7 UNICORE Broker Users 38

  39. GAFSI-SSH File name: Job$Cluster$R$memory1024$Condor$start 2 GAFSI‐C 3 \<GAFSI-S Watch-path> GAFSI 6 GAFSI‐C Condor pool 5 Condor 1 4 GAFSI‐C Condor_schedd SFTP 7 GAFSI‐C UNICORE 1 4 UCC 5 SFTP 7 GAFSI‐C UNICORE Users Broker 39

  40. Simple Example: R code 1. Create the input files: for (j in 1:Grid.workers){ ... save(param,dataList,iterationList,file=paste(j,".RData", sep="")) } 2. Copy them to the GAFSI watch path: for (j in 1:Grid.workers){ file.copy(paste(j,".RData", sep=""),paste(Grid.workers.addresses[j], "\\input.RData", sep="")) } 40

  41. Simple Example: R code 3. Copy the code file to the same path: file.copy("worker.apl.kf.R", paste(Grid.mainpath,"\ \","code.R", sep="")) 4. Create the start file to trigger the submission: file.create(paste(Grid.mainpath,"\\ mytask$cluster$R mytask$cluster$R $memory300$start $memory300$start", sep="")) 41

  42. Simple Example: R code 5. Wait for the completion, then collect the result files: while(TRUE){ Sys.sleep(1) if(file.exists(Grid.mainpath+ “ mytask$cluster$R$exports=result.RData$memory300$start mytask$cluster$R$exports=result.RData$memory300$start ”)) next } //Result collection for(j in 1:results){ load(Grid.mainpath+”\\result”+j+”.RData”) } 42

  43. Initial Performance Evaluation • CPU utilization of R process during the execution of a parallel version PSM.estimate() statistical modeling function on Condor 43

  44. Conclusions and Future work • Maintaining stability with scalability together with achieving system transparancy is a considerable challenge. • We’ve proposed a broker overlay based model as an infrastructure to maintain stability with scalability. • A grid access file system interface is proposed to solve the concurrency problem. It is currently being implemented on Condor and UNICORE frameworks. • The proposed architecture is to be implemented on existing Grid frameworks. • GAFSI is to be implemented on Linux based on FUSE. 44

  45. Thank You 45

  46. Additional Slides 46

  47. Machine organization: Flat • gLite Workload Management System ( WMS ) 47

  48. Machine organization: Flat • Condor Central Manager ( CM ) 48

  49. Machine organization: Flat • Globus Grid Resource Allocation & Management ( GRAM ) 49

Recommend


More recommend