High Performance Network Monitoring Challenges for Grids Les - PowerPoint PPT Presentation

High Performance Network Monitoring Challenges for Grids Les Cottrell , Presented at the Internation Symposium on Grid Computing 2006, Taiwan www.slac.stanford.edu/grp/scs/net/talk05/iscg-06.ppt Partially funded by DOE/MICS for Internet End-to-end 1 Performance Monitoring (IEPM) וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףף �ِ�ِ�

Why & Outline • Data intensive sciences (e.g. HEP) needs to move large volumes of data worldwide – Requires understanding and effective use of fast networks – Requires continuous monitoring • For HEP LHC-OPN focus on tier 0 and tier 1 sites, i.e. just a few sites • Outline of talk: – What does monitoring provide? – Active E2E measurements today and challenges – Visualization, forecasting, problem ID – Passive monitoring • Netflow, • SNMP, • Conclusions 2 וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףף�ِ�ِ�

Uses of Measurements • Automated problem identification & trouble shooting: – Alerts for network administrators, e.g. • Bandwidth changes in time-series, iperf, SNMP – Alerts for systems people • OS/Host metrics • Forecasts for Grid Middleware, e.g. replica manager, data placement • Engineering, planning, SLA (set & verify) • Also (not addressed here): – Security: spot anomalies, intrusion detection – Accounting 3 וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףף�ِ�ِ�

• Several NRENs, layer 2 & 3 • Level of access an open issue 4 וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףף�ِ�ِ�

LHC-OPN: Logical view • The diagram to the right is a logical representation of the LHC-OPN showing monitoring hosts • The LHC-OPN extends to just inside the T1 “ edge ” • Read/query access should be guaranteed on LHC-OPN “ owned ” equipment. • We also request RO access to devices along the path to enable quick fault isolation 5 Courtesy: Shawn McKee וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףף�ِ�ِ�

Active E2E Monitoring 6 וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףף�ِ�ِ�

E.g. Using Active IEPM-BW measurements • Focus on high performance for a few hosts needing to send data to a small number of collaborator sites, e.g. HEP tiered model • Makes regular measurements with tools – Ping (RTT, connectivity), traceroute – pathchirp, ABwE, pathload (packet pair dispersion) – iperf (single & multi-stream), thrulay, – Possibly bbftp, bbcp (file transfer applications) • Looking at GridFTP but complex requiring renewing certificates • Lots of analysis and visualization • Running at major HEP sites: CERN, SLAC, FNAL, BNL, Caltech to about 40 remote sites – http://www.slac.stanford.edu/comp/net/iepm- 7 bw.slac.stanford.edu/slac_wan_bw_tests.html וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףף�ِ�ِ�

IEPM-BW Measurement Topology • 40 target hosts in 13 countries • Bottlenecks vary from 0.5Mbits/s to 1Gbits/s • Traverse ~ 50 AS ’ , 15 major Internet providers • 5 targets at PoPs, rest at end sites 8 וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףף�ِ�ِ�

Ping/traceroute • Ping still useful ( plus ca reste … ) – Is path connected/node reachable? – RTT, jitter, loss – Great for low performance links (e.g. Digital Divide), e.g. AMP (NLANR)/PingER (SLAC) – Nothing to install, but blocking • OW AMP/I2 similar but O ne W ay – But needs server installed at other end and good timers – Being built into IEPM-BW • Traceroute – Needs good visualization (traceanal/SLAC) – Little use for dedicated λ layer 1 or 2 9 – However still want to know topology of paths וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףף�ِ�ِ�

Bottleneck Packet Pair Dispersion Min spacing Spacing preserved At bottleneck • Send packets with known separation On higher speed links • See how separation changes due to bottleneck • Can be low network intrusive, e.g. ABwE only 20 packets/direction, also fast < 1 sec • From PAM paper, pathchirp more accurate than ABwE, but – Ten times as long (10s vs 1s) – More network traffic (~factor of 10) • Pathload factor of 10 again more – http://www.pam2005.org/PDF/34310310.pdf • IEPM-BW now supports ABwE, Pathchirp, Pathload 10 וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףף�ِ�ِ�

BUT … • Packet pair dispersion relies on accurate timing of inter packet separation – At > 1Gbps this is getting beyond resolution of Unix clocks – AND 10GE NICs are offloading function • Coalescing interrupts, Large Send & Receive Offload, TOE • Need to work with TOE vendors – Turn off offload (Neterion supports multiple channels, can eliminate offload to get more accurate timing in host) – Do timing in NICs – No standards for interfaces • Possibly packet trains, e.g. pathneck 11 וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףף�ِ�ِ�

Achievable Throughput • Use TCP or UDP to send as much data as can memory to memory from source to destination • Tools: iperf (bwctl/I2), netperf, thrulay (from Stas Shalunov/I2), udpmon … • Pseudo file copy: Bbcp and GridFTP also have memory to memory mode 12 וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףף�ِ�ِ�

BUT … • At 10Gbits/s on transatlantic path Slow start takes over 6 seconds – To get 90% of measurement in congestion avoidance need to measure for 1 minute (5.25 GBytes at 7Gbits/s (today ’ s typical performance) • Needs scheduling to scale, even then … • It ’ s not disk-to-disk or application-to application – So use bbcp, bbftp, or GridFTP 13 וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףף�ِ�ِ�

AND … • For testbeds such as UltraLight, UltraScienceNet etc. have to reserve the path – So the measurement infrastructure needs to add capability to reserve the path (so need API to reservation application) – OSCARS from ESnet developing a web services interface (http://www.es.net/oscars/): • For lightweight have a “ persistent ” capability • For more intrusive, must reserve just before make measurement 14 וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףף�ِ�ِ�

Visualization & Forecasting 15 וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףף�ِ�ِ�

Examples of real data Caltech : thrulay • Misconfigured windows 800 • New path Mbps 0 • Very noisy Nov05 Mar06 UToronto: miperf • Seasonal effects 250 – Daily & weekly Mbps 0 Jan06 Nov05 Pathchirp • Some are seasonal UTDallas 120 • Others are not thrulay Mbps • Events may affect 0 iperf Mar-20-06 Mar-10-06 multiple-metrics • Events can be caused by host or site congestion • Few route changes result in bandwidth changes (~20%) • Many significant events are not associated with route 16 changes (~50%) וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףף�ِ�ِ�

Changes in netw ork topology (BGP) can result in dramatic changes in performance Hour Samples of traceroute trees generated from the Remote host table ) s p b M 0 0 1 ( s o t t e N - s o L Snapshot of traceroute summary table Notes: 1. Caltech misrouted via Los-Nettos 100Mbps commercial net 14:00-17:00 2. ESnet/GEANT working on routes from 2:00 to 14:00 3. A previous occurrence went un-noticed for 2 months 4. Next step is to auto detect and notify Drop in performance Back to original path Dynamic BW capacity (DBC) (From original path: SLAC-CENIC-Caltech to SLAC-Esnet-LosNettos (100Mbps) -Caltech ) Mbits/s Changes detected by IEPM-Iperf and AbWE Available BW = (DBC-XT) Cross-traffic (XT) Esnet-LosNettos segment in the path 17 (100 Mbits/s) ABwE measurement one/minute for 24 hours Thurs Oct 9 9:00am to Fri Oct 10 9:01am וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףף�ِ�ِ�

Forecasting • Over-provisioned paths should have pretty flat time series • But seasonal trends (diurnal, weekly need to be accounted for) on about 10% of our paths • Use Holt-Winters triple exponential weighted moving averages – Short/local term smoothing – Long term linear trends 18 – Seasonal smoothing וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףף�ِ�ِ�

Alerting • Have false positives down to reasonable level, so sending alerts • Experimental • Typically few per week. • Currently by email to network admins – Adding pointers to extra information to assist admin in further diagnosing the problem, including: • Traceroutes, monitoring host parms, time series for RTT, pathchirp, thrulay etc. • Plan to add on-demand measurements (excited about perfSONAR) 19 וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףף�ِ�ِ�

In progress • Integrate IEPM-BW and PingER measurements with MonALISA to provide additional access • Working to make traceanal a callable module – Integrating with AMP • When comfortable with forecasting, event detection will generalize • Looking at ARMA/ARIMA for forecasting 20 וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףף�ِ�ِ�

Passive - Netflow 21 וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףף�ِ�ِ�

High Performance Network Monitoring Challenges for Grids Les - PowerPoint PPT Presentation

High Performance Network Monitoring Challenges for Grids Les Cottrell , Presented at the Internation Symposium on Grid Computing 2006, Taiwan www.slac.stanford.edu/grp/scs/net/talk05/iscg-06.ppt Partially funded by DOE/MICS for Internet

Scientific Computing I Grids Strcutured Grids Unstrcutured Grids Module 7: Grid Generation

UPM DAY 1: SMART GRIDS TABLE 1: TECHNOLOGICAL CHALLENGES RELATED WITH SMART GRIDS DEVELOPMENT

Challenges for Grids Challenges for Grids Markus Schulz CERN IT GD LCG/EGEE Disclaimer

Schr dinger equation on Schr 256^ 4 grids 256^ 4 grids , * Toshiyuki Imamura 13

MLSS 06 - Canberra Elements Hierarchical Basis Sparse Grids Sparse Grids Combination

I ntroduction to the NRENs and Grids w orkshops Catalin Meirosu TERENA 4 th NRENs and Grids w

Tuesday Wednesday Thursday Friday Keynotes Keynotes Keynotes parallel Photo coffee Grids

Emerging Global Energy Network Emerging Global Energy Network Regional electricity grids

Jos Manuel Martn INYCOM DAY 1: SMART GRIDS TABLE 1: TECHNOLOGICAL CHALLENGES RELATED WITH

Challenges of energy forecasting for smart grids Modelling Smart Grids, Prague 11th of September

Grids and EGEE are not just for High Energy Physicists Richard Hopkins, National e-Science Centre

GPU Ray-tracing using Irregular Grids Arsne Prard-Gayot, Javor Kalojanov, Philipp Slusallek

LYNAS MALAYSIA Key monitoring data As at October 2019 1 RADIOLOGICAL MONITORING PERFORMANCE

High Precision Based Network Performance Monitoring in critical infrastructures Presented by Rik

Cascading Failures in Power Grids - Analysis and Algorithms Saleh Soltan 1 , Dorian Mazaruic 2 ,

Seizing the Mini- grids Opportunity: Market Trends and Pathways to Growth State of the Global

Aligning CMMI to Business Objectives Dr. Thomas Greb Dr. Ralf Kneuper SEPG Europe 2008 June

ANALYST RESULT BRIEFING FINANCIAL PERIOD 6 MONTHS ENDING 31 MARCH 2017 25 MAY 2017 DISCLAIMER

Why was Discovery Developed? Flow Assurance Needs: There are no reliable subsea detection tools

WP3 - Remote Sensing Scallop Evaluation Scottish Inshore Fisheries Conference - 27 th April 2017

LOBSTER Large-Scale Monitoring of Broadband Internet Infrastructure Arne sleb

Maria Fotellis Agenda What we do Lets start with why Caring for our people

ImplementIng CmmI for Development moDel maturIty level 2 (StageD) CIB SuCCeSS Story 2017

re rede defin fined ed Integrated Software Solutions To Save Time And Increase Productivity

Sambuz

Useful Links

Newsletter

Mail Us