Making a Lion Bulletproof: SRE in Banking Robin van Zijll & - PowerPoint PPT Presentation

Making a Lion Bulletproof: SRE in Banking Robin van Zijll & Janna Brummel (@jannabrummel) QCon NY, June 26 2019

ING is a global financial organization, active in 41 countries This talk is about the retail bank of NL with… 9 million debit cards 8 million retail customers 7 million ATM transactions/month

Mobile banking is used by 4.5 million customers Together, they log in 6 million times a day (100+ TPS)

AVAI AVAILAB ABILITY FIGUR GURES 2018 PRI PRIME E TIME E (06:30 AM – 01: 01:00 00 AM) Uptime Downtime 0.13 0.22 regulator target 99.88 99.87 99.77 INTERNET BANKING MOBILE BANKING

Logins per second for Mobile Banking 140 120 100 80 60 40 20 0 20:00 20:00 00:00 04:00 08:00 12:00 16:00 00:00 04:00 08:00 12:00 16:00

AVAI AVAILAB ABILITY FIGUR GURES 2018 24 24 HOURS A DAY Uptime Downtime customer 99.999 expectation 0.22 0.37 99.78 99.63 INTERNET BANKING MOBILE BANKING

SR-what? Site Reliability Engineering is “what happens when you ask a software engineer to design an operations function” – Ben Traynor (Google)

People

At ING we are organized in tribes with (Biz)DevOps squads responsible for build and run tribe tribe tribe tribe product owners tribe lead squad squad squad squad Our SRE team is a ‘horizontal’ squad part of a productivity engineering tribe We support 1700 engineers across 340 squads

Our SRE team 7 engineers (4 dev, 3 ops) 2 more joining soon 1 product owner 1 chapter lead mostly with engineering and on-call experience in ING product engineering

When we hire SREs, we look for someone who’s Passionate about reliability, problems, DevOps and open source OK with failure Insensitive to hierarchy Willing to teach and advise engineers about reliability Experienced in on-call duties and 1+ language(s) in our stack Still excited to work with us after meeting half our team and having heard realistic job expectations

Process

Why and how did we start with SRE? We used to have a small team of ops engineers on call for online channels These engineers were the ones up at night, but they could not structurally improve service reliability because of our DevOps model SRE pilot was started and supported • Team was transformed and given a new purpose • Decided on SRE model, way of working and roadmap • Experiences and proposal were presented to senior management After knowledge transfer of old tasks, SRE was launched :)

For SRE, we generally see 3 organizational models our model product product tribe SRE product engineering + SRE engineering engineering SREs Service ownership is shared SREs are distributed and Service ownership is with PE, between PE and SRE embedded in PE teams, SRE consults and creates tools service ownership is shared

What do we do as SREs? Curious to learn more about… Learning from failure? Check out • Product Jason’s and Ryan’s talk Development Chaos engineering and graceful • degredation? Check out Lorne’s talk Capacity Planning • High impact outlier system failures? Testing + Release Check out Laura’s talk Procedures Postmortem/RCA Incident Response Monitoring Service Reliability Hierarchy, from O’Reilly’s Site Reliability Engineering (2016)

What do we do as SREs? We We spend 80% 80% of our tim ime on engin ineerin ing • We deliver the Reliability Toolkit: a white-box monitoring and alerting stack • We work on a secure container platform with a service mesh in public cloud We We spread SR SRE E lo love and best practic ices • We reach out to engineers to consult and get feedback • We educate on reliability topics Wh What we don’t ’t do do • On-call for product engineering • Work on SRE-topics already covered by other teams in our organization

We do outreach and we educate on SRE topics We We facilitate kn knowledg dge sharing We We reach out to engineers • Cross-domain SRE guild • Feedback loop for products • SRE demo sessions open to all • We are reliability advocates • Guidance via chat and intranet We edu We ducate engineers • Prometheus user community • Engineering onboarding • Conference report out • Prometheus workshops

When we demo, we sometimes block the hallway

We use these principles in our way of working We work with industry standards We work with open source products and practices We automate toil wherever and whenever we can

Technology

Why did we develop the Reliability Toolkit? Mean time to repair is too long – we waste time finding incident owners Lack of insight into application health for teams High level of technology diversity makes implementing monitoring difficult

How does the Reliability Toolkit work? Grafana Applications Prometheus Alert Manager E-mail, SMS (Message Bird) and ChatOps (Mattermost) Model Builder

How do we provision the Reliability Toolkit? SRE Team Together with a We maintain and We deliver the We deliver client libraries team we create update binaries Reliability Toolkit so metrics can be a joint config on 5 instances over scraped from servers 3 environments, we remain responsible

Before, teams would own and use a full pipeline… = version combine build publish deploy reliability control configurations toolkit done by done by devops devops team team

…now they only own and update config = reliability version combine toolkit control configurations build done by devops deploy team

Increasing and improving usage of Reliability Toolkit Include client libraries in engineering frameworks Ensure a good feedback loop: in person or in tooling Educate others during onboarding and workshops Template team dashboards and make other dashboards accessible to all

And now Reliability Toolkit usage has been increasing

We made onboarding and using our Reliability Toolkit easy, but our 70 onboarded teams still need to ensure that Prometheus can scrape metrics How can we reach all 340 teams?

Let’s try a service mesh! Curious? Check the Software Defined Infrastructure track

Why use service mesh to improve reliability? • Service mesh helps us to get new/updated functionality to applications fast • We can improve observability for all: metrics, logs, distributed tracing and resilience patterns based on incident learnings that work out of the box • We can introduce/expand A/B testing, canary releasing and staged rollouts • Engineers only need to worry about security at application level: immutable containers, zero trust network and security policies for free, taking away risk documentation work

What are we working on next? • Scaling in our Reliability Toolkit stack for efficient use of resources, scaling up number of teams using our stack • Expanding our role as reliability advocates • Completing PoC with service mesh

Takeaways • Hire SREs from your product engineering domain • Never compromise on mindset in SREs • Start with a pilot if you are not sure if SRE works for you • Pick a SRE model that works well for your organization • Try to get senior management support and understanding • Invest in SRE outreach and education • Focus on scalability and ease-of-use in your tooling • Don’t be afraid of redesign if it makes users happier

Questions? Icons used are all from flaticon.com

Making a Lion Bulletproof: SRE in Banking Robin van Zijll & - PowerPoint PPT Presentation

Making a Lion Bulletproof: SRE in Banking Robin van Zijll & Janna Brummel (@jannabrummel) QCon NY, June 26 2019 ING is a global financial organization, active in 41 countries This talk is about the retail bank of NL with 9 million debit

Food Lion Retail Recovery Compliance Workshop Food Lion & Second Harvest The Food Lion

The SRE I aspire to be Yaniv Aknin // @aknin #VelocityConf San Jose 2019 The SRE I aspire to be //

For Science Panel ADEC 2010 Source Reduction Evaluation (SRE) Source Reduction Evaluation (SRE)

The Lion Pilot The Lion Pilot The Lion pilot program was created by the Boy Scouts of America

Ajax Bulletproof progressive enhancement behaviour JavaScript presentation CSS structure

The 1 Year and 1 hour Capacity Plan in the Drupal World About me Principal SRE @Acquia

Offering banking services in a mobile world Denise Buckton Head of Mobile and Phone Banking

Ex#nc#on White Lion Where do they originate from? Timbava# White Lion

EVENTS AND CELEBRATIONS THE RED LION WWW.THEREDLIONOXFORD.CO.UK WELCOME TO THE RED LION

The Lion, the Lamb, and the Little Child Isaiah 11:6 The Lion, the Lamb, and the Little Child

Presented to you by: Marvin and TaVari 1.MOUNTAIN LION WHAT REGION OF NORTH IS THE MOUNTAIN

New Zealand sea lion research 2007/08 to 2009/2010 B. L. Chilvers NZ sea lion research

Lion Selection Group (LSX) Investor briefing: Mining market update Placement & SPP to

PROCESS AND PROCEDURES AMIR ALFATAKH YUSOF ISLAMIC BANKING FROM CONVENTIONAL TO ISLAMIC BANKING

Automated Troubleshooting of Live Site Issues Sriram Srinivasan PayPal SRE May 23, 2017 About

Building @drensin // rensin@google.com Liz Fong-Jones Successful SRE in Staff Developer

Subteam 3 How to Use Prior Knowledge in Process design and development - General

The Evolving Body of Knowledge The Body of Knowledge and Curriculum to Advance Systems

Knowledge Engineering Sargur Srihari srihari@cedar.buffalo.edu 1 Machine Learning Srihari

Quality Qualitative Photographic Elicitation in WP Practice Ben Copsey City, University of

CDFA Financing Roundtable Webcast: Unlocking the Development Finance Toolbox in Texas The

Outline Introduction Engineering context School of Engineering (SOE) process Review Panel

An Ethical Framework and Toolkit Brian Patrick Green, Ph.D. Director of Technology Ethics

Mini-video on website Our 29 Signatories Workforce Reliability Roadmap Develop qualified

Sambuz

Useful Links

Newsletter

Mail Us

Making a Lion Bulletproof: SRE in Banking Robin van Zijll & - PowerPoint PPT Presentation

Making a Lion Bulletproof: SRE in Banking Robin van Zijll & Janna Brummel (@jannabrummel) QCon NY, June 26 2019 ING is a global financial organization, active in 41 countries This talk is about the retail bank of NL with 9 million debit

Food Lion Retail Recovery Compliance Workshop Food Lion &amp; Second Harvest The Food Lion

The SRE I aspire to be Yaniv Aknin // @aknin #VelocityConf San Jose 2019 The SRE I aspire to be //

For Science Panel ADEC 2010 Source Reduction Evaluation (SRE) Source Reduction Evaluation (SRE)

The Lion Pilot The Lion Pilot The Lion pilot program was created by the Boy Scouts of America

Ajax Bulletproof progressive enhancement behaviour JavaScript presentation CSS structure

The 1 Year and 1 hour Capacity Plan in the Drupal World About me Principal SRE @Acquia

Offering banking services in a mobile world Denise Buckton Head of Mobile and Phone Banking

Ex#nc#on White Lion Where do they originate from? Timbava# White Lion

EVENTS AND CELEBRATIONS THE RED LION WWW.THEREDLIONOXFORD.CO.UK WELCOME TO THE RED LION

The Lion, the Lamb, and the Little Child Isaiah 11:6 The Lion, the Lamb, and the Little Child

Presented to you by: Marvin and TaVari 1.MOUNTAIN LION WHAT REGION OF NORTH IS THE MOUNTAIN

New Zealand sea lion research 2007/08 to 2009/2010 B. L. Chilvers NZ sea lion research

Lion Selection Group (LSX) Investor briefing: Mining market update Placement &amp; SPP to

PROCESS AND PROCEDURES AMIR ALFATAKH YUSOF ISLAMIC BANKING FROM CONVENTIONAL TO ISLAMIC BANKING

Automated Troubleshooting of Live Site Issues Sriram Srinivasan PayPal SRE May 23, 2017 About

Building @drensin // rensin@google.com Liz Fong-Jones Successful SRE in Staff Developer

Subteam 3 How to Use Prior Knowledge in Process design and development - General

The Evolving Body of Knowledge The Body of Knowledge and Curriculum to Advance Systems

Knowledge Engineering Sargur Srihari srihari@cedar.buffalo.edu 1 Machine Learning Srihari

Quality Qualitative Photographic Elicitation in WP Practice Ben Copsey City, University of

CDFA Financing Roundtable Webcast: Unlocking the Development Finance Toolbox in Texas The

Outline Introduction Engineering context School of Engineering (SOE) process Review Panel

An Ethical Framework and Toolkit Brian Patrick Green, Ph.D. Director of Technology Ethics

Mini-video on website Our 29 Signatories Workforce Reliability Roadmap Develop qualified

Sambuz

Useful Links

Newsletter

Mail Us

Food Lion Retail Recovery Compliance Workshop Food Lion & Second Harvest The Food Lion

Lion Selection Group (LSX) Investor briefing: Mining market update Placement & SPP to