asiapacific regional operation center
play

AsiaPacific Regional Operation Center Min-Hong Tsai ASGC ISGC May - PowerPoint PPT Presentation

AsiaPacific Regional Operation Center Min-Hong Tsai ASGC ISGC May 3, Academia Sinica http://www.twgrid.org/aproc/ 1 Agenda Introduction ROC Status Recent Activities 2 APROC I ntroduction APROC Goal Provide


  1. AsiaPacific Regional Operation Center Min-Hong Tsai ASGC ISGC May 3, Academia Sinica http://www.twgrid.org/aproc/ 1

  2. Agenda • Introduction • ROC Status • Recent Activities 2

  3. APROC I ntroduction • APROC Goal • Provide deployment support facilitating Grid expansion • Maximize the availability of Grid services • Supports EGEE sites in Asia Pacific since April 2005 • EGEE CIC • CIC-on-duty rotation: EGEE global operations • Monitoring tool development: GStat and GGUS Search • VO services • EGEE ROC • Monitoring, Diagnosis and Problem tracking M/W release deployment support • Security Coordination Site Registration • Portal and documentation 3

  4. ASGCCA • Production service since July 2003 • Taiwan • LCG/EGEE users in Asia Pacific without local CA • Member of both • EUGridPMA • APGridPMA • http://ca.grid.sinica.edu.tw 4

  5. VO I nfrastructure Support • APROC hosts centralized services for VOs • Host VOMS server • VO assigns manager to maintain membership • VO supply AUP • Host LFC global file catalogue service • Resource Broker • Top-Level BDII • Currently supporting • TWGrid • APeSci 5

  6. EGEE Site Registration and Certification • Registration Procedure: • http://www.twgrid.org/aproc/doc/admin_intro/newrc/ • Guidance for user and host certificate registration • Registration into GOCDB • Recommend startup documentation • Instructions for further registration in • Mailing lists • VO membership • APROC ticketing system • Consulting on site architecture and deployment • Deployment support and troubleshooting • Site certification • Manual tests • SFT and GStat tests 6

  7. Middleware and Operations Support • Middleware Support • Installation support • New release testing • Supplementary release notes • Assist in coordination of updates and upgrades • Operations Support • Review and track GGUS and APROC tickets • Monitor and detect new problems • Provide detailed technical support to sites • Support Channels • Phone • Email • TRS Ticketing System 7

  8. APROC Portal • www.twgrid.org/aproc • Rollout Highlights • Supplemental documentation • Getting started links • Registration information • Contact Info and TRS links • lists.grid.sinica.edu.tw/apwiki • Supplementary release notes • Site Operations Procedures • Technical Howtos • Trouble Shooting FAQs • APF and GDA meeting minutes • Feel free to contribute! 8

  9. Agenda • Introduction • ROC Status • Recent Activities 9

  10. Members and Biweekly meeting • 11 sites, 7 countries, ~ 600 CPUs • Australia Japan • India Korea • Pakistan Singapore • Taiwan • APF Meetings • Short biweekly meeting between AP sites • Topics • Operation: M/W issues, operations news, review site status • Service challenge: news and announcements • Welcome other topics, such as BELLE or other regional topics 10

  11. Site Registration • Site Registration • Recently: • JP-KEK-CRC-01 • In progress • Australia-UNIMELB-LCG2 • JP-KEK-CRC-02 • TW-THU-HPC • PAKGRID3-LCG2 • Welcomed site from CERN ROC • INDIACMS-TIFR • NCP-LCG2 • PAKGRID-LCG2 11

  12. APROC Usage I • Total computing capacity is increasing • But so is utilization (peak over 80% ) 700 % CPU Usage 600 100 500 90 80 400 70 totalCPU % Usage 60 runJob 300 50 40 200 30 20 100 10 0 0 4/1/2005 5/1/2005 6/1/2005 7/1/2005 8/1/2005 9/1/2005 10/1/2005 11/1/2005 12/1/2005 1/1/2006 2/1/2006 3/1/2006 4/1/2006 5/1/2006 4/1/05 5/1/05 6/1/05 7/1/05 8/1/05 9/1/05 10/1/05 11/1/05 12/1/05 1/1/06 2/1/06 3/1/06 4/1/06 5/1/06 12

  13. APROC Usage I I • Jobs predominately from Biomed, CMS and Atlas VOs • Past year: 41 KSI2K Years • This April: 21 KSI2K Years 13

  14. APROC Availability I • Ideal Grid World: May 3, 2006 14

  15. APROC Availability I I • Daily snapshots of SFT results of each site • Availability of 60-70% • Better if weighted with numbers of CPU • CT mostly replica management failure • Sensitive to Information System performance • Network Issues • Network congestion and packet loss • APROC SmokePing to monitor net performance • But monitoring from CERN is more relevant • Scheduled Downtime • Network and power maintenance Decommissioned • Hardware maintenance and upgrade Slow BDI I • Middleware upgrade 2.4 2.6 2.7 100% 100% 90% 90% 80% 80% SD 70% 70% CT CT 60% 60% JL 50% JL 50% JS 40% 40% JS OK 30% 30% OK 20% 20% 10% 10% 0% 0% 2005-04 2005-06 2005-08 2005-10 2005-12 2006-02 2006-04 4 5 6 7 8 9 0 1 2 1 2 3 4 0 0 0 0 0 0 1 1 1 0 0 0 0 - - - - - - - - - - - - - 5 5 5 5 5 5 5 5 5 6 6 6 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 2 2 2 2 15

  16. Support I ssues and Tickets Statistic (Total/Monthly Avg) • Remote troubleshooting • Email interaction is slow Open tickets 10 • Remote testing is limited Close tickets 425/39 • Reluctantly ask for access Total tickets 435/40 to services • Local diagnostic tools would be helpful 16

  17. ASGCCA Status • Improvements • Overhaul of certificate registration • instructions and application forms • Step-by-step guide for browser certificate management • Addition of FAQ sections to address common tasks • In progress • Certificate import error related to Firefox 1.5 • Design and implement new RA procedures • Revise and update CP/CPS 17

  18. Agenda • Introduction • ROC Status • Recent Activities 18

  19. Security Service Challenge 1 • Purpose to ensure that: • Sufficient information is available for audit trace (for IR) • Appropriate communication channels are available • Security Challenge with (OSCT) • Sending test jobs • Sites recover evidence • DN of job submitter IP address of submission UI • Executable name Time when executable ran • Results • Completed March 2006 for a period of one week • Instructions and audit guide sent to participating sites • 4 of 7 APROC sites completed challenge • Some sites could not participate due to SD or unavailability • Some results were incomplete since sites did not have Resource Broker (RB) • Sites need to contact RB admin for more information • Helpful learning exercise to familiarize security contacts with auditing process for LCG • Improvements • Sharing of audit techniques between ROCs (GOCWiki) • Tools to extract security audit information • Helpful for future SSC to measure security patch response time 19

  20. Pre-Production Service • APROC started PPS service in April 2006 • Previously managed by Application team • PPS deployment with glite-3.0 RC2 complete • Mix of LCG and gLite components • LCG-CE gLite-CE • MON combined UI • Integration of production SE and SRM services • FTS still needs to be deployed • Summary • Good way to get experience with gLite middleware • Using YAIM is very good transition for ROC staff • LCG components are more stable than gLite counterparts • Required significant support from CERN for gLite-CE • Integration with lcg-CE batch system was not trivial • Still troubleshooting • Need significant time to relearn administration and troubleshooting techniques • Administration documentation like ones accumulated for LCG in GOCWiki would be helpful 20

  21. Grid Administrator Tutorial I • Goal and details • Educate and train EGEE Site Administrators • Two day tutorial with instruction in Chinese • Hosted at Academia Sinica in March 2006 • Topics covered • Grid technology and components • Operations, administration and troubleshooting • Brief overview of Grid applications • Hands-on session to deploy functional sites • 36 Xen servers configured • Simple CA, RB, BDII, VOMS, LFC provided • 5 teams of 6 participants deployed sites (UI, MON, CE, WN, DPM-head, DPM-disk) • Based on Marco La Rosa’s KEK tutorial • http://lists.grid.sinica.edu.tw/apwiki/Grid_Administrator_Tutorial_Hands-on_Instructions 21

  22. Grid Administrator Tutorial I I • Results • 30 participants from 15 institutes • 4.18/5.0 survey evaluation scores • Only a couple teams where able to complete a fully functional site • Not enough time • Setup YAIM configuration from scratch • Time consuming and error prone • More realistic and gives chance for participants to troubleshoot • Feedback • Break up hands-on session to practice after each lecture • Provide a reference cheat sheet • Acronyms • Grid architecture diagrams • Suggest Linux training material as prerequisite • Provide user and developer tutorials • Significant time to setup hands-on session servers for installation • Is this available in GLIDA? 22

  23. GStat Development • Instances created for • PPS Service • Regional projects • balticGrid, EELA, • EUChinaGrid, etc.. • Usage calculations modified • PhysicalCPU • SizeTotal, SizeFree • Results published • To Service Availability Monitoring Environment (SAME) at CERN • Client tool for to retrieve historical data • http://goc.grid.sinica.edu.tw/gocwiki/GStat_Client_Tools 23

Recommend


More recommend