Distributed Operations into the EGI era: EGEE and EGI EGEE and EGI Vera Hansper Vera Hansper TERENA TF-NOC meeting, NORDUNET, 3 rd May 2010
Brief History Brief History 2007 • Solely NDGF-T1 operations 2008 • Begun collaboration with EGEE operations • New Head of Operations appointed in October • NE ROC (SNIC) merge operations with NDGF (Nov 2008) 2009 2009 • New operations Team comprising of SNIC and NDGF teams • New operations combining NDGF ops and EGEE ops • Update of procedures Update of procedures • New ticketing system • NOC joins to cover T1 247 requirements 2010 • Change in rostering style • EGI ... new challenges TERENA TF-NOC meeting, NORDUNET, 3 rd May 2010
EGEE Era EGEE Era – The Team The Team NDGF • Vera Hansper (Finnish Node Co-ord, Head of Operations) • Jens Larsson (Swedish Node Co-ord.) • Tore Mauset (Norwegian Node Co-ord.) • Anders Rhod Gregersen (Danish Node Co-ord, weekend on call) • Mattias Wadenstein (Systems Integrator, weekend on call) SNIC SNIC • Zeeshan Ali Shah (PDC) (only 2009) • Thomas Bellman (NSC) • Michaela Lechner (PDC) ( ) • Andreas Davour (PDC) (New) • Roger Oscarsson (HPC2N) • Åke Sandgren (HPC2N) TERENA TF-NOC meeting, NORDUNET, 3 rd May 2010
What does EGEE era mean? What does EGEE era mean? NDGF was set up as a distributed T1 center for LHC resources from Nordic countries • Operations focused on those resources only • B Based on ARC middleware d ARC iddl NE ROC, as part of EGEE, covers resources based on gLite middleware middleware • Includes other Tier resources • Covers more than just Nordic countries Operations fell into 2 Categories • NDGF operations (OoD) • EGEE operations (ROD) TERENA TF-NOC meeting, NORDUNET, 3 rd May 2010
What's done where What's done where NDGF-T1 is a distributed T1! G s a d st buted NDGF Operations cover sites which are running the ARC middleware. These include • Denmark • Finland • Norway • Slovenia • S Sweden d • Switzerland NE ROC ROD duty covers sites which fall under the NE ROC Nordics region and also run gLite. These include ROC Nordics region and also run gLite. These include • Baltic Grid • Finland • Norway • Sweden TERENA TF-NOC meeting, NORDUNET, 3 rd May 2010
NDGF Operations from 2007 NDGF Operations from 2007 NDGF team on 5 week rotation roster • One person per shift Information about operations is found at https://portal.nordu.net/display/ndgfwiki/Operation htt // t l d t/di l / d f iki/O ti • This is constantly under development Alerts from nagios are sent by email to the whole team • • OoD also get SMSs OoD also get SMSs OoD must fill in a daily operations log • Attend WLCG weekly ops meeting Communication lines are numerous! Communication lines are numerous! • Jabber • Email • Wiki logs – updated daily! g p y • Phone .... • The occasional pigeon ... TERENA TF-NOC meeting, NORDUNET, 3 rd May 2010
NDGF Operations merge with SNIC NDGF Operations merge with SNIC Merger occurred November 2008 Runs 8/7, with SNIC and NDGF teams alternating on a 6 week rotation roster • One person per shift • Weekend on call shifts handled by NDGF ops team Information about operations can be found at https://portal nordu net/display/ndgfwiki/Operation https://portal.nordu.net/display/ndgfwiki/Operation Alerts from nagios are sent by email to both teams • NDGF on duty operators also get SMSs On duty operator must fill in a daily operations log On duty operator must fill in a daily operations log (moved to the weekly meetings) • Attend the weekly WLCG meeting TERENA TF-NOC meeting, NORDUNET, 3 rd May 2010
NDGF Operations from 2009 NDGF Operations from 2009 - 247 247 24/7 operations are required by MoU with the LHC/WLCG. NORDUnet NOC agreed to cover this need. • NOC handle after hours ( 17:00 CET – 09:00+1 CET) ops • Monitor Level 1 critical services • Receive NAGIOS alerts via email after hours • Communicate directly with active responsible persons in case of emergency emergency • Usually Anders, Gerd or Mattias … • Have their own independent roster First iteration of 247 began on the 9 th of July 2009 g y • Initial coverage was from 17:00 to 22:00 CET • Final steps to have full 247 started September 2009 TERENA TF-NOC meeting, NORDUNET, 3 rd May 2010
EGEE Operations EGEE Operations EGEE operations moved from a centrally managed system (COD) to a p y g y ( ) regional managed model (ROD) in 2009. • NE ROC has been in the regional model since the beginning of 2009 and NDGF has been instrumental in the creation process of the structure of the model. Th There are various layers to the EGEE operations i l t th EGEE ti • Site Administrators 1 st Line Support • • • ROD ROD • C-COD Monitors site availability through SAM tests • Managed through the CIC portal g g p • The Regional Dashboard provides the dashboard and tools for ROD. Extra Documentation specific to EGEE • https://twiki.cern.ch/twiki/bin/view/EGEE/OperationalDocumentationCERN edms repository
Tools Tools NDGF monitoring tools • Nagios • Ganglia • DCACHE dashboard • FTS monitoring FTS it i • WLCG/EGEE monitoring systems • SAM tests • GRIDMAP GRIDMAP • GRIDVIEW EGEE monitoring tools • SAM (Based on a CERN developed monitoring system) ( p g y ) • Changed to NAGIOS based early 2010 • GRIDMAP • GRIDVIEW • ROD Dashboard • Tools/Services linked to dashboard TERENA TF-NOC meeting, NORDUNET, 3 rd May 2010
A view of a tool A view of a tool ... Developed by CIC at Developed by CIC at IN2P3 (France) Integrates monitoring and data base services TERENA TF-NOC meeting, NORDUNET, 3 rd May 2010
A view of a tool cont A view of a tool cont. Alarms from failing tests (NAGIOS) can be monitored b it d • Sites can be contacted through the notepad GGUS tickets can be created, GGUS tickets can be created, monitored and solved directly • There are time constraints for these too TERENA TF-NOC meeting, NORDUNET, 3 rd May 2010
C Communication channels:Ticketing System i ti h l Ti k ti S t NOC operations moved their ticketing system to a JIRA based system at the beginning of 2009 • NDGF adopted this system a short time later Sites can subscribe to receive ticket information Sit b ib t i ti k t i f ti • Tickets or issues can be tailored to send them to sites, or keep the issue internal • Used by NDGF for NDGF-T1 operations Used by NDGF for NDGF T1 operations Still use EGEE system (GGUS) in tangent • Used mainly for EGEE operations • NDGF-T1 also receives GGUS tickets • Outside users submit these TERENA TF-NOC meeting, NORDUNET, 3 rd May 2010
Communication : customers Communication : customers Sites and Site administrators == Customers • Site admins are encouraged to subscribe to the NDGF ticketing system • http://mail.ndgf.org/mailman/listinfo/ndgfticket • Small volume list, mainly to notify admins about central (NDGF-T1) service S ll l li t i l t tif d i b t t l (NDGF T1) i maintenance • Site admins should subscribe to EGEE alarm notifications • https://cic.gridops.org/index.php?section=roc&page=alertnotification p g p g p p p g • Can be done on a site or node basis Operators are encouraged to be proactive • Alert sites/admins about a problem immediately – minimise ticket creation (EGEE- style) • The faster a problem is solved, the better the overall availability of the site • (Important for both NDGF-T1 and EGEE level operations!) TERENA TF-NOC meeting, NORDUNET, 3 rd May 2010
Operations is a also support unit Operations is a also support unit Operators can issue downtime in the GOCDB for sites The mailing list is actively read • Admins can freely use it to communicate with the operators and admins d i • Developers also read this list Help Help – advice, training, etc. advice training etc • Some of the operators are cheap – you only need to mention beer! No question is too trivial q TERENA TF-NOC meeting, NORDUNET, 3 rd May 2010
Operations at NDGF: EGEE era Operations at NDGF: EGEE era Combined daily efforts of NDGF and SNIC operations t teams, out of hours handled by NOC t f h h dl d b NOC • NDGF/SNIC operators rostered one week in six • NOC has it's own rostering system Lots of communications channels Lots of communications channels • JABBER room used extensively • NDGF and EGEE ticketing Many monitoring tools and logging systems Many monitoring tools and logging systems • NAGIOS – NDGF and CERN • GANGLIA, FTS, dCACHE • WIKI • DASHBOARD Attend numerous meetings • Weekly WLCG meeting • Daily WLCG operations meeting • Weekly NDGF meeting TERENA TF-NOC meeting, NORDUNET, 3 rd May 2010 • Nordic NE ROC phone meeting
Recommend
More recommend