Data Transfers in the Grid: Data Transfers in the Grid: Workload Analysis of Globus Globus GridFTP Workload Analysis of GridFTP Nicolas Kourtellis, Lydia Prieto, Gustavo Zarrate, Adriana Iamnitchi Adriana Iamnitchi Nicolas Kourtellis, Lydia Prieto, Gustavo Zarrate, University of South Florida University of South Florida Dan Fraser Dan Fraser Argonne National Laboratory Argonne National Laboratory
2 2
Objective 1: : Quantify volume of transfers Quantify volume of transfers Objective 1 What is the transfer size distribution? What is the transfer size distribution? What is the volume of activity for the most active hosts? What is the volume of activity for the most active hosts? Objective 2: : Understand how tuning capabilities are Understand how tuning capabilities are Objective 2 used used What are the buffer sizes used during the transfers? What are the buffer sizes used during the transfers? What is the average bandwidth? What is the average bandwidth? What is the utilization of functionalities like streams and What is the utilization of functionalities like streams and stripes? stripes? Objective 3: : Quantify user base and predict usage Quantify user base and predict usage Objective 3 trends trends How does the user base evolve over time? How does the user base evolve over time? What are the geographical characteristics of the GridFTP GridFTP What are the geographical characteristics of the data transfers? data transfers? 3 3
Outline Outline Metrics dataset Metrics dataset Surprises and … … Surprises and … zoom in (TeraGrid TeraGrid) ) … zoom in ( Lessons and discussions Lessons and discussions 4 4
GridFTP Metrics Dataset GridFTP Metrics Dataset Field Range of Values Comment Source hostname/host IP String/IPnet Anonymized Start time of the transfer Timestamp Accuracy: ms End time of the transfer Timestamp Accuracy: ms TCP Buffer Size Integer (Bytes) ≥ 0 Total Number of Bytes Integer (Bytes) ≥ 0 Number of Streams Integer ≥ 1 Number of Stripes Integer ≥ 1 STOR, RETR, Store or Retrieve Integer (0, 1,2) LIST 5 5
Metrics Dataset Metrics Dataset Started with ~137.5 million records (Jul’ ’05 05 - - Mar’ ’07) 07) Started with ~137.5 million records (Jul Mar Cleaning: Cleaning: – – transfer size ≤ transfer size ≤ 0: 0: ~22.8 million records ~22.8 million records – – buffer size <0: buffer size <0: ~1000 records ~1000 records – – directory listings: directory listings: ~3.9 million records ~3.9 million records – – invalid hostnames (e.g., /[B@89712e): ~4,600 records invalid hostnames (e.g., /[B@89712e): ~4,600 records – – ANL- ANL -TeraGrid TeraGrid testing: testing: ~11.4 million records ~11.4 million records – – duplicate reports: duplicate reports: ~16.8 million records ~16.8 million records – – self transfers (source=destination): self transfers (source=destination): ~5.75 million records ~5.75 million records Clean database: ~77.2 million records (~56.2%) Clean database: ~77.2 million records (~56.2%) 6 6
Surprise #1: Transfer Size Distribution Surprise #1: Transfer Size Distribution 10 % of total 5 0 0 2 4 8 16 32 64 128 256 512 1 2 4 8 16 32 64 128 256 512 1 2 4 8 16 32 64 128 256 512 1 Bytes KB MB Transfer size Objective 1: : Quantify volume of transfers Quantify volume of transfers Objective 1 7 7
Zoom- -in: in: TeraGrid TeraGrid Zoom Are these results representative for Are these results representative for production grids? production grids? – GridFTP testing for deployment and learning – GridFTP testing for deployment and learning Identify transfers from TeraGRid TeraGRid and Identify transfers from and analyze dataset. analyze dataset. 8 8
Transfer Size Distribution (TG) Transfer Size Distribution (TG) 25 20 % of total 15 10 5 0 0 2 4 8 16 32 64 128 256 512 1 2 4 8 16 32 64 128 256 512 1 2 4 8 16 32 64 128 256 512 1 Bytes K Β M Β Transfer Size Objective 1 Objective 1: : Quantify volume of transfers Quantify volume of transfers 9 9
10 % of total All 5 0 0 2 4 8 16 32 64 128 256 512 1 2 4 8 16 32 64 128 256 512 1 2 4 8 16 32 64 128 256 512 1 Bytes KB MB Transfer size 25 20 TG % of total 15 10 5 0 0 2 4 8 16 32 64 128 256 512 1 2 4 8 16 32 64 128 256 512 1 2 4 8 16 32 64 128 256 512 1 Bytes K Β M Β Transfer Size 10 10
Why So Small Transfers? Why So Small Transfers? There are still many old versions (i.e., before There are still many old versions (i.e., before v3.9.5) of GridFTP GridFTP in use. These versions do v3.9.5) of in use. These versions do not include trace reporting capabilities. not include trace reporting capabilities. Other data transfer protocols and Other data transfer protocols and implementations are used implementations are used Users have turned off the reporting capability Users have turned off the reporting capability Some of the logs are inevitably lost due to the Some of the logs are inevitably lost due to the UDP- -based reporting mechanism based reporting mechanism UDP The low transfer volumes could suggest a shift The low transfer volumes could suggest a shift towards data- -aware job scheduling (?) aware job scheduling (?) towards data 11 11
Server to Server Transfers Server to Server Transfers 80% 72.2% # Transfers Volume 70% 60% 50% 39.5% 38.8% 40% 30% 21.7% 19.7% 20% 8.2% 10% 0% InterDomain InterIP SelfTransfers – – High reporting of Self Transfers (more than 1/3) High reporting of Self Transfers (more than 1/3) Objective 1: Objective 1 : Quantify volume of transfers Quantify volume of transfers 12 12
Top 6 Active Hosts (all) Top 6 Active Hosts (all) 250 1.0E+08 Volume Transferred Number of Transfers 1.0E+07 200 Volume Transferred (TB) 1.0E+06 Number of Transfers 1.0E+05 150 1.0E+04 100 1.0E+03 1.0E+02 50 1.0E+01 0 1.0E+00 1 2 3 4 5 6 Host Top 6 hosts traffic adds up to ~28% of total volume Top 6 hosts traffic adds up to ~28% of total volume Next 48 hosts (IPs IPs) transferred 10s of TB ) transferred 10s of TB Next 48 hosts ( Objective 1: Objective 1 : Quantify volume of transfers Quantify volume of transfers 13 13
Number of Transfers & Volume (TG) Number of Transfers & Volume (TG) 3000 80 Number of Transfers per Month 70 Number of Transfers (thousands) 2500 Total Volume per Month 60 2000 50 e (TB) 1500 40 Volum 30 1000 20 500 10 0 0 Aug-05 Nov-05 Jan-06 May-06 Jun-06 Jul-06 Aug-06 Nov-06 Jan-07 Sep-05 Oct-05 Dec-05 Feb-06 Mar-06 Apr-06 Sep-06 Oct-06 Dec-06 Feb-07 Mar-07 Month-Year Objective 1: Objective 1 : Quantify volume of transfers Quantify volume of transfers 14 14
Average Transfer Size Average Transfer Size & Total Volume (TG) & Total Volume (TG) 160 0.8 Total Volume (TB) Average Transfer Size (GB) 140 0.7 Average Transfer Size (GB) 120 0.6 Total Volume (TB) 100 0.5 80 0.4 60 0.3 40 0.2 20 0.1 0 0 1 2 3 4 5 6 7 8 TERAGRID SITES Objective 1: Objective 1 : Quantify volume of transfers Quantify volume of transfers 15 15
Daily Workload (TG) Daily Workload (TG) Average Volume per Day Average Number of Transfers per Day 1.40 50000 45000 1.20 40000 Number of Transfers 1.00 35000 Volume (TB) 30000 0.80 25000 0.60 20000 15000 0.40 10000 0.20 5000 0.00 0 Monday TuesdayWednesdayThursday Friday Saturday Sunday Average volume transferred per day: ~ 0.6TB Average volume transferred per day: ~ 0.6TB GridFTP doesn’ ’t get weekends free! t get weekends free! GridFTP doesn Objective 1: Objective 1 : Quantify volume of transfers Quantify volume of transfers 16 16
Monthly Workload (TG) Monthly Workload (TG) 2.5 140000 Average Volume Transferred per day Average Number of Transfers per day 120000 2.0 Number of Transfers 100000 Volume (TB) 1.5 80000 60000 1.0 40000 0.5 20000 0.0 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Day of the month ~50,000 transfers per day ~50,000 transfers per day ~1TB per day of total volume ~1TB per day of total volume Lowest around 0.5TB per day Lowest around 0.5TB per day Peaks due to particular days Peaks due to particular days Objective 1: Objective 1 : Quantify volume of transfers Quantify volume of transfers 17 17
Recommend
More recommend