ASGC Tier1 Center & Service Challenges activities ASGC, Jason Shih וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףף �ِ�ِ�
Outlines n Tier1 center operations Resource status, QoS and utilizations q User support q Other activities in ASGC (exclude HEP) q Biomed DC2 n q Service availability n Service challenges SC4 disk to disk throughput testing q n Future remarks SA improvement q Resource expansion q וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףף�ِ�ِ�
ASGC T1 operations וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףף�ِ�ِ�
WAN connectivity וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףף�ِ�ِ�
ASGC Network וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףף�ِ�ِ�
Computing resources n Instability of IS cause ASGC service endpoints removed from exp. bdii n High load on CE have impact to site info published (site GIIS running on CE) וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףף�ِ�ِ�
Job execution at ASGC n Instability of site GIIS cause dynamic information publish error n High load of CE lead to abnormal functioning of maui וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףף�ِ�ِ�
OSG/LCG resource integration n Mature tech help integrating resources GCB introduced to help integrating q with IPAS T2 computing resources CDF/OSG users can submit jobs by q gliding-in into GCB box Access T1 computing resources from q “ twgrid ” VO n Customized UI to help accessing backend storage resources Help local users not ready for grid q HEP users access T1 resources q וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףף�ِ�ِ�
ASGC Helpdesk n Currently support following services (queue): CIC/ROC q PRAGMA q HPC q SRB q n Classification of sub-queue of CIC/ROC: T1 q CASTOR q SC q SSC q וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףף�ِ�ِ�
ASGC TRS: Accounting Statistic (Tot/Ave) Open tickets 10 Close tickets 425/39 Total tickets 435/40 וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףף�ِ�ِ�
Biomed DC2 n Add 90KSI2k dedicate for DC2 activities, introduced additional subcluster in IS n Maintaining site functional to help utilizing grid jobs First run on partial of 36690 ligands from DC2 (started at 4 April, 2006 Troubleshooting grid-wide q issues Collaborate with biomed in AP q operation n AP: GOG-Singapore devoted resources for DC2. Fourth run (started at 21 April) וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףף�ِ�ِ�
Biomed DC2 ( cont ’ ) n Two framework introduced DIANE and wisdom q n Ave. 30% contribution from ASGC, in 4 run (DIANE) CE/subclusters 1st 2nd 3rd 4th Q-HPC 15.4 28.8 8.4 12.4 (quanta.grid) Prod. LCG 10.6 0 37.2 9 (lcg00125.grid) וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףף�ِ�ِ�
Service Availability וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףף�ِ�ِ�
Service challenges - 4 וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףף�ِ�ِ�
SC4 Disk to disk transfer n problem observed at ASGC: q system crash immediately when tcp buffer size increase castor experts help in trouble shooting, but prob. remains n for 2.6 kernel + xfs download kernel to 2.4 + 1.2.0rh9 gridftp + xfs n again, crash if window sized tuned n problem resolved only when down grade gridftp to identical n version for SC3 disk rerun (Apr. 27, 7AM) try with one of disk server, and move forward to rest of three q 120+ MB/s have been observed q continue running for one week q וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףף�ِ�ِ�
Castor troubleshooting Gridftp version * Stable Tuned $ Kernel XFS Stable 2.4(1) ** 1.8-13d Y N/A N/A 2.4(1) 1.2.0rh9 Y Y N 2.4(2) + 1.8-13d Y N/A N/A 2.4(2) + 1.2.0rh9 Y Y N 2.6 ++ 1.8-13d Y Y Y 2.6 ++ 1.2.0rh9 Y Y N 2.6 1.2.0rh9 N Y N 2.6 1.1.8-13d N Y Y *gridftp bundled in castor + ver. 2.4, 2.4.21-40.EL.cern, adopted from CERN ** ver 2.4, 2.4.20-20.9.XFS1.3.1, introduced by SGI ++ exact ver no 2.6.9-11.EL.XFS $ tcp window size tuned, max to be 128MB Stack size recompiled to 8k for each experimental kernel adopted וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףף�ِ�ِ�
SC Castor throughput: GridView n disk to disk nominal rate q currently ASGC have reach120+ MB/s static throughput q Round robin SRM headnodes associate with 4 disk servers, each provide ~30 MB/s q debugging kernel/castor s/w issues early time of SC4 (reduction to 25% only, w/o further tuning) וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףף�ِ�ِ�
Tier-1 Accountings: Jan – Mar, 2006 Tier-1 Site alice atlas cms lhcb sum % AsiaPacific Taiwan-LCG2 0 43244 18823 0 62067 2.33 BNL BNL-LCG2 0 1271894 0 0 1271894 47.75 CERN CERN-PROD 6630 123194 258790 53626 442240 16.6 FNAL USCMS-FNAL-WC1 0 0 129620 0 129620 4.87 FZK FZK-LCG2 0 97152 51935 10147 159234 5.98 IN2P3 IN2P3-CC 0 70349 27300 10107 107756 4.05 INFN-T1 INFN-T1 0 0 0 0 0 0 NorduGrid Nordic 0 0 0 0 0 0 PIC pic 0 95067 64920 32371 192358 7.22 RAL RAL-LCG2 9031 156114 77025 21210 263380 9.89 SARA/NIKHEF SARA-MATRIX 783 5966 342 5744 12835 0.48 TRIUMF TRIUMF-LCG2 0 20489 693 818 22000 0.83 sum 16444 1883469 629448 134023 2663384 % 0.62 70.72 23.63 5.03 וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףף�ِ�ِ�
Accounting: VO וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףף�ِ�ِ�
Overall Accounting: CMS/Atlas וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףף�ِ�ِ�
CMS usage: CRAB monitoring וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףף�ِ�ِ�
SRM QoS monitoring: CMS Heartbeat וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףף�ِ�ِ�
Castor2@ASGC n testbed expected to deployed end of March n delayed due to obtaining LSF license from Platform q DB schema trouble shooting q overlap HR in debugging castorSC q throughput revised 2k6 Q1 quarterly report q n separate into two phase, phase (I) w/o considering tape functional testing q plan to connect to tape system in next n phase expect mid May to complete phase (I) n phase (II) plan to finish mid of Jun. q וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףף�ِ�ِ�
Future remarks n Resource expansion plan n QoS improvement n Castor2 deployment n New tape system installed q Continue with disk to tape throughput validation n Resource sharing with local users q For users more ready using grid q Large storage resource required וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףף�ِ�ِ�
Resource expansion: MoU T1 FTT* CPU Disk Date CPU Disk Racks Tape (year) (#) (TB) (#) (TB) (#) (TB) 2006 950 400 12 500 200 15 2007 1770 900 34 800 300 30 2008 3400 1500 61 1300 400 75 2009 3600 2400 85 2000 - - *FTT: Federated Taiwan Tire2 וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףף�ِ�ِ�
Resource expansion (I) n CPU q Current status: 430 KSI2k (composite by IBM HS20 and Quanta Blades) n q Goal: q Quanta Blades 7U, 10blades, Dual CPU, ~1.4 ksi2k/cpu n ratio 30 ksi2k/7U, to meet 950KSI2k n need 19 chassis (~4 racks) n q IBM Blades LV model available (save 70% power consumption) n Higher density, 54 processors (dual core + SMP Xeon) n Ratio ~80 KSI2k/7U, only 13 chassis needed (~3 racks) n וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףף�ِ�ِ�
Resource expansion (II) n Disk q Current status 3U array, 400GB drive, 14 drives per rack n ratio: 4.4 TB/6U n q Goal: 400 TB ~ 90 Arrays needed n ~ 9 racks (assume 11 arrays per rack) n Tape n New 3584 tape lib installed mid of May q 4 x LTO4 tape drives provide ~ 80MB/s throughput q expected to be installed in mid-March q delayed due to q internal procurement n update items of projects from funding agency n Expected new tape system implemented at mid-May q full system in operation within two weeks after installed. q וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףף�ِ�ِ�
IBM 3584 vs. STK SL8500 IBM 3584 STK SL8500 Modular library design 5 years New Redundant robotics Yes TBD Accessors required for redundancy 2 8 Any to any cartridge to drive access Yes No Min/Max single library slot 58 / 6,881 1,448 / 6,632 configuration Maximum tape drive configuration 192 64 Maximum cartridge capacity supported 400 GB / LTO3 200 GB / 9940C Maximum single library capacity 2.75 PB 1.33 PB Cartridge density in Slots /Sq-ft 41* 29 Storage density in TB/Sq-ft 16.4* 5.7 Audit time <60 sec/frame < 60 min Average cell to drive time 1.8 sec* 5 sec Required software expense None HSC/ACSLS Software required for Remote No Yes Management וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףף�ِ�ِ�
Resource expansion (III) n C2 area of IPAS new machine room q Rack space design q AC, cooling requirement: for 20 racks: 1,360,000 BTUH or 113.3 tons of cooling - n 2800 ksi2k 36 racks: 1,150,000 BTUH or 95 tons - 1440 TB n q HVAC: ~800 kVA estimated HS20: 4000Watt * 5*20 + STK array: 1000Watt * 11*36 ) n q generator וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףף�ِ�ِ�
Recommend
More recommend