distributed operations into the egi era egee and egi egee
play

Distributed Operations into the EGI era: EGEE and EGI EGEE and EGI - PowerPoint PPT Presentation

Distributed Operations into the EGI era: EGEE and EGI EGEE and EGI Vera Hansper Vera Hansper TERENA TF-NOC meeting, NORDUNET, 3 rd May 2010 Brief History Brief History 2007 Solely NDGF-T1 operations 2008 Begun collaboration


  1. Distributed Operations into the EGI era: EGEE and EGI EGEE and EGI Vera Hansper Vera Hansper TERENA TF-NOC meeting, NORDUNET, 3 rd May 2010

  2. Brief History Brief History  2007 • Solely NDGF-T1 operations  2008 • Begun collaboration with EGEE operations • New Head of Operations appointed in October • NE ROC (SNIC) merge operations with NDGF (Nov 2008)  2009  2009 • New operations Team comprising of SNIC and NDGF teams • New operations combining NDGF ops and EGEE ops • Update of procedures Update of procedures • New ticketing system • NOC joins to cover T1 247 requirements  2010 • Change in rostering style • EGI ... new challenges TERENA TF-NOC meeting, NORDUNET, 3 rd May 2010

  3. EGEE Era EGEE Era – The Team The Team  NDGF • Vera Hansper (Finnish Node Co-ord, Head of Operations) • Jens Larsson (Swedish Node Co-ord.) • Tore Mauset (Norwegian Node Co-ord.) • Anders Rhod Gregersen (Danish Node Co-ord, weekend on call) • Mattias Wadenstein (Systems Integrator, weekend on call)  SNIC  SNIC • Zeeshan Ali Shah (PDC) (only 2009) • Thomas Bellman (NSC) • Michaela Lechner (PDC) ( ) • Andreas Davour (PDC) (New) • Roger Oscarsson (HPC2N) • Åke Sandgren (HPC2N) TERENA TF-NOC meeting, NORDUNET, 3 rd May 2010

  4. What does EGEE era mean? What does EGEE era mean? NDGF was set up as a distributed T1 center for LHC resources from Nordic countries • Operations focused on those resources only • B Based on ARC middleware d ARC iddl NE ROC, as part of EGEE, covers resources based on gLite middleware middleware • Includes other Tier resources • Covers more than just Nordic countries  Operations fell into 2 Categories • NDGF operations (OoD) • EGEE operations (ROD) TERENA TF-NOC meeting, NORDUNET, 3 rd May 2010

  5. What's done where What's done where NDGF-T1 is a distributed T1! G s a d st buted  NDGF Operations cover sites which are running the ARC middleware. These include • Denmark • Finland • Norway • Slovenia • S Sweden d • Switzerland  NE ROC ROD duty covers sites which fall under the NE ROC Nordics region and also run gLite. These include ROC Nordics region and also run gLite. These include • Baltic Grid • Finland • Norway • Sweden TERENA TF-NOC meeting, NORDUNET, 3 rd May 2010

  6. NDGF Operations from 2007 NDGF Operations from 2007  NDGF team on 5 week rotation roster • One person per shift  Information about operations is found at https://portal.nordu.net/display/ndgfwiki/Operation htt // t l d t/di l / d f iki/O ti • This is constantly under development  Alerts from nagios are sent by email to the whole team • • OoD also get SMSs OoD also get SMSs  OoD must fill in a daily operations log • Attend WLCG weekly ops meeting  Communication lines are numerous!  Communication lines are numerous! • Jabber • Email • Wiki logs – updated daily! g p y • Phone .... • The occasional pigeon ... TERENA TF-NOC meeting, NORDUNET, 3 rd May 2010

  7. NDGF Operations merge with SNIC NDGF Operations merge with SNIC  Merger occurred November 2008  Runs 8/7, with SNIC and NDGF teams alternating on a 6 week rotation roster • One person per shift • Weekend on call shifts handled by NDGF ops team  Information about operations can be found at https://portal nordu net/display/ndgfwiki/Operation https://portal.nordu.net/display/ndgfwiki/Operation  Alerts from nagios are sent by email to both teams • NDGF on duty operators also get SMSs  On duty operator must fill in a daily operations log  On duty operator must fill in a daily operations log (moved to the weekly meetings) • Attend the weekly WLCG meeting TERENA TF-NOC meeting, NORDUNET, 3 rd May 2010

  8. NDGF Operations from 2009 NDGF Operations from 2009 - 247 247  24/7 operations are required by MoU with the LHC/WLCG.  NORDUnet NOC agreed to cover this need. • NOC handle after hours ( 17:00 CET – 09:00+1 CET) ops • Monitor Level 1 critical services • Receive NAGIOS alerts via email after hours • Communicate directly with active responsible persons in case of emergency emergency • Usually Anders, Gerd or Mattias … • Have their own independent roster  First iteration of 247 began on the 9 th of July 2009 g y • Initial coverage was from 17:00 to 22:00 CET • Final steps to have full 247 started September 2009 TERENA TF-NOC meeting, NORDUNET, 3 rd May 2010

  9. EGEE Operations EGEE Operations  EGEE operations moved from a centrally managed system (COD) to a p y g y ( ) regional managed model (ROD) in 2009. • NE ROC has been in the regional model since the beginning of 2009 and NDGF has been instrumental in the creation process of the structure of the model.  Th  There are various layers to the EGEE operations i l t th EGEE ti • Site Administrators 1 st Line Support • • • ROD ROD • C-COD  Monitors site availability through SAM tests • Managed through the CIC portal g g p • The Regional Dashboard provides the dashboard and tools for ROD.  Extra Documentation specific to EGEE • https://twiki.cern.ch/twiki/bin/view/EGEE/OperationalDocumentationCERN edms repository

  10. Tools Tools  NDGF monitoring tools • Nagios • Ganglia • DCACHE dashboard • FTS monitoring FTS it i • WLCG/EGEE monitoring systems • SAM tests • GRIDMAP GRIDMAP • GRIDVIEW  EGEE monitoring tools • SAM (Based on a CERN developed monitoring system) ( p g y ) • Changed to NAGIOS based early 2010 • GRIDMAP • GRIDVIEW • ROD Dashboard • Tools/Services linked to dashboard TERENA TF-NOC meeting, NORDUNET, 3 rd May 2010

  11. A view of a tool A view of a tool ...  Developed by CIC at  Developed by CIC at IN2P3 (France)  Integrates monitoring and data base services TERENA TF-NOC meeting, NORDUNET, 3 rd May 2010

  12. A view of a tool cont A view of a tool cont. Alarms from failing tests (NAGIOS) can be monitored b it d • Sites can be contacted through the notepad GGUS tickets can be created, GGUS tickets can be created, monitored and solved directly • There are time constraints for these too TERENA TF-NOC meeting, NORDUNET, 3 rd May 2010

  13. C Communication channels:Ticketing System i ti h l Ti k ti S t  NOC operations moved their ticketing system to a JIRA based system at the beginning of 2009 • NDGF adopted this system a short time later  Sites can subscribe to receive ticket information  Sit b ib t i ti k t i f ti • Tickets or issues can be tailored to send them to sites, or keep the issue internal • Used by NDGF for NDGF-T1 operations Used by NDGF for NDGF T1 operations  Still use EGEE system (GGUS) in tangent • Used mainly for EGEE operations • NDGF-T1 also receives GGUS tickets • Outside users submit these TERENA TF-NOC meeting, NORDUNET, 3 rd May 2010

  14. Communication : customers Communication : customers  Sites and Site administrators == Customers • Site admins are encouraged to subscribe to the NDGF ticketing system • http://mail.ndgf.org/mailman/listinfo/ndgfticket • Small volume list, mainly to notify admins about central (NDGF-T1) service S ll l li t i l t tif d i b t t l (NDGF T1) i maintenance • Site admins should subscribe to EGEE alarm notifications • https://cic.gridops.org/index.php?section=roc&page=alertnotification p g p g p p p g • Can be done on a site or node basis  Operators are encouraged to be proactive • Alert sites/admins about a problem immediately – minimise ticket creation (EGEE- style) • The faster a problem is solved, the better the overall availability of the site • (Important for both NDGF-T1 and EGEE level operations!) TERENA TF-NOC meeting, NORDUNET, 3 rd May 2010

  15. Operations is a also support unit Operations is a also support unit  Operators can issue downtime in the GOCDB for sites  The mailing list is actively read • Admins can freely use it to communicate with the operators and admins d i • Developers also read this list  Help  Help – advice, training, etc. advice training etc • Some of the operators are cheap – you only need to mention beer!  No question is too trivial q TERENA TF-NOC meeting, NORDUNET, 3 rd May 2010

  16. Operations at NDGF: EGEE era Operations at NDGF: EGEE era  Combined daily efforts of NDGF and SNIC operations t teams, out of hours handled by NOC t f h h dl d b NOC • NDGF/SNIC operators rostered one week in six • NOC has it's own rostering system  Lots of communications channels  Lots of communications channels • JABBER room used extensively • NDGF and EGEE ticketing  Many monitoring tools and logging systems  Many monitoring tools and logging systems • NAGIOS – NDGF and CERN • GANGLIA, FTS, dCACHE • WIKI • DASHBOARD  Attend numerous meetings • Weekly WLCG meeting • Daily WLCG operations meeting • Weekly NDGF meeting TERENA TF-NOC meeting, NORDUNET, 3 rd May 2010 • Nordic NE ROC phone meeting

Recommend


More recommend