how we un scattered our dns setup and unlocked new
play

How we un-scattered our DNS setup and unlocked new automation - PowerPoint PPT Presentation

How we un-scattered our DNS setup and unlocked new automation options Dan Ldtke Technical Lead SRE @ eGym GmbH Make the gym work for everyone! Digital strength machines "Fitness Cloud" Unify training data


  1. How we un-scattered our DNS setup and unlocked new automation options Dan Lüdtke Technical Lead SRE @ eGym GmbH

  2. ● Make the gym work for everyone! ● Digital strength machines ● "Fitness Cloud" ○ Unify training data across vendors ● Data Analysis ● Apps ● Research Projects ○ Improve Diabetes patients symptoms through special training program

  3. A year ago...

  4. team foo.tu.ts.egym.com ~200 artifact team space Domains Profit! >30 Name Registrars 5 Servers # s t a r t u p l i f e (do first, ask later)

  5. Issues TLD ● Ran into maximum Managed Zone NS limit on Google Cloud DNS ● Horrible lookups! A NS ○ Slowing down customers egym.de ○ Hard to debug x.egym.de CNAME x.co.ts.egym.com ● Deployment Strategy #YOLO B ● "Haunted Graveyard" co.ts.egym.com ○ Only few were allowed to touch DNS ○ Even fewer dared to touch DNS NS C co.ts.egym.com x.co.ts.egym.com CNAME elb-123.aws.com

  6. Lessons Learned Organizational structure and infrastructure evolve differently. Don't force one onto the other. Use company-wide unique artifact names in DNS.

  7. Let's Improve!

  8. What is the Problem here? One does not Agility! simply change DNS We build it, How to we run it! rollback? SRE is too Web interface slow does not changing provide DNS atomicity! SREs Devs

  9. Divide and Conquer DNS Data ● Volatile ○ Special test domain ○ No availability guarantees Agility ○ Everyone can change directly ○ No reviews ○ No tests ○ No atomicity (no changesets) ● Production ○ Version control Reliability ○ Reviewed changes ○ Tested for common mistakes ○ Tested for syntax, logic, deployment feasibility ○ Atomic deployment of whole changeset

  10. Do we really have competing goals? We need reviewed, version- We need rapid controlled change during changes in development . production . SREs Devs

  11. Storing DNS Data

  12. Zone Data coffee.egym.zone.yml zones: - zone: egym.coffee ● Version Control description: Test zone. ttl: 300 ○ Git repository templates: ○ All developers have access - gmail - website ● YAML-based format names: - name: '@' ○ Developer love it texts: data: ■ compared to zone files ;) - foobar-site-verification-123456 ○ Easy to read and understand - name: paloalto forwarding: ● Templating functionality ttl: 60 target: flaky.cloud.example.com. - name: losangeles addresses: literals: - 192.0.2.99 - 2001:db8:200::99

  13. Zone Data (Template) gmail.template.yml templates: - template: gmail ● Tradeoff between description: > This template adds Google ○ Principle of Least Surprise mail servers to a zone. ○ Don't Repeat Yourself (DRY) names: - name: '@' ● Typical templates mail: ttl: 604800 ○ Set of mail servers mailservers: - mailserver: aspmx.l.google.com. ○ Set of name servers (delegation) priority: 10 ○ Domain Parking - mailserver: alt1.aspmx.l.google.com. priority: 20 ○ Redirect to commercial website - name: google._domainkey texts: data: - > v=DKIM1; k=rsa; p=foobar123456

  14. Validating DNS Data

  15. Resource Record Database (RRDB) ● Go package ● Limited dependencies ○ Go Standard Library ○ YAMLv2 ● High test coverage ● Unfortunately: Battle-tested

  16. RRDB Internals: Trie Data Structure egym my-service . com root node my-service A com AAAA egym de A AAAA it MX pl ... TXT

  17. RRDB Internals: Today's Features ● Logic checks within nodes ○ E.g. CNAME and most other record types are mutually exclusive ● Back-and-forth traversal ○ Parent pointers ● Logic checks across nodes ○ E.g. Node with NS records should not have children ● Walk and query the Trie ● Idea: Inheritance of certain values (e.g. TTL)

  18. RRDB Internals: Past Disasters What we believed to be serving com egym foobar . AAAA foobar egym com de NS it pl E N D O F L I F E What we actually foobar served AAAA old DNS server

  19. New Process

  20. New Deployment Workflow Push Commit

  21. New Deployment Workflow Push YAML Commit Lint

  22. New Deployment Workflow RRDB Push YAML Logic Commit Lint Checks

  23. New Deployment Workflow RRDB Deploy Push YAML Logic to DNS Commit Lint Checks Staging

  24. New Deployment Workflow RRDB Deploy Push YAML Logic to DNS Review Commit Lint Checks Staging

  25. New Deployment Workflow RRDB Deploy Deploy Push YAML Logic to DNS Review to DNS Commit Lint Checks Staging Production

  26. Benefits of New Process ● DNS workflow and moving parts are out-of-band ○ Code and Pipeline on Bitbucket ○ Independent from the records we serve ● Pipeline run takes ~1.5 minutes ○ Before: review took hours or days ○ Including all checks ○ Including full staging deployment

  27. Lessons Learned Automated checks lower the entry barrier and empower developers. Democratize critical infrastructure! De-haunt the graveyards!

  28. Battle-tested Existing Tools ● Record Store (Shopify) ○ No Cloud DNS support (added Jan '18) ○ We were just moving away from Ruby within SRE ● OctoDNS (Github) ○ No Cloud DNS support (added Oct '17) ● Denominator (Netflix) ○ No Cloud DNS support ● DNSControl (Stack Exchange) ○ Go ○ Uses Domain Specific Language ○ We did not know about it

  29. Lesson Learned We may have fallen for Not-Invented-Here...? Do proper research!

  30. Use our tools if all of the following apply ● You love YAML ● You need a Go library (RRDB) ● Google Cloud DNS is your only DNS provider ● You need to walk & query the final dataset ○ Custom checks ○ Service Discovery ○ Special Needs ● Prefer a small binary ○ that fits into out-of-band pipelines

  31. Achievements Unlocked ● DNS is finally out-of-band ● DNS is not scary anymore! ○ Spreads the review load from SRE to everyone ● Certificate Automation in Kubernetes ○ Cluster Issuer uses DNS-01 challenge ■ works for client certificate protected hostnames ○ Developers can request valid Let's Encrypt certificates via Certificate Resource ■ even before DNS is pointed to the corresponding Ingress Resource ● Configuration-less Delegation Monitoring ○ Automatically monitors all domains that appear on Cloud DNS ○ Alert on domain take-over ○ Alert on delegation errors

  32. Open Source dns-tools and RRDB Join Munich SRE Meetup! ● https://bitbucket.org/egym-com/dns-tools/ Full story of our DNS Journey in our tech blog! ● https://code.egym.de/ Fitness and engineering careers: egym.com Mostly non-political, tech-related, (re-)tweets: @danrl_com I blog about SRE and technology: https://danrl.com

Recommend


More recommend