Taskerman A Distributed Cluster Task Manager Raghavendra D Prabhu - PowerPoint PPT Presentation

Taskerman A Distributed Cluster Task Manager Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer Distributed Systems

Yelp’s Mission Connecting people with great local businesses.

Datastore Ecosystem @

Cassandra Elasticsearch Zookeeper PostgreSQL

Any many more.. …. ● Memcached ● Redis ● Spark ● Redshift ● DynamoDB ● PaaStorm ● S3 5

Distributed Systems ● Several TB in Cassandra clusters with tens of nodes each ● Close to a million messages/second in streaming pipeline ● Several TB in Elasticsearch with several hundred nodes in each ● Many PB archived to S3 every month ● Multi-AZ Multi-Region ● And growing…

“Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra cluster without spiking read latency” “Reboot 1000 instances without taking a millennia but not bringing down site either” “Upgrade an Elasticsearch cluster from m3.medium to m3.xlarge safely without downtime”

Pet vs Cattle

Maintenance Cost Engineering Efficiency Scalability

Taskerman

Requirements ● Safe ● Security ● Generic and Extensible ● Distributed ● Loosely coupled ● Cluster awareness

Desirable ● Schedulable ● Reusable ● Auditability ○ Not Ad-hoc ○ More Declarative, Less Imperative ○ Config Management ● Maintainability ● Observability ● Resilience

Safety ● Paramount* ● Serialized execution ○ ‘m’ out of ‘n’ ○ Disjoint jobs. ● Avoid cascade ● Privilege escalation ● Pull-based * Unless oncall is automated too.

Fallacies of Distributed System ● Network is reliable ● Latency is zero ● Bandwidth is infinite ● Network is secure ● One administrator ● Transport cost is zero ● Network is homogenous ● Topology doesn't change

Quotes There are only two hard problems in distributed systems: 2. Exactly-once delivery 1. Guaranteed order of messages 2. Exactly-once delivery @mathiasverraes There are 2 hard problems in computer science: cache invalidation, naming things, and off-by-1 errors. @secretGeek

Building Blocks ● Scheduler ● Router ● Co-ordinator ● Transport ● Executor ● Error handler ● Configuration ● Monitoring ● Tooling

Queue Router Task Scheduler EC2 API Zookeeper Dead Failure Node Queues Letter Queue Q1 Q3 Q2 Workqueue Retries Lease Flow of task T1 T3 T2 Cluster

#Anatomy of a Taskerman Task # Restart action for 2 nodes of geo_counter # cassandra cluster owned by gsi { ‘action’: ‘cassandra_task:restart’, ‘version’: 1.2, ‘limit’: 2, ‘cluster_name’: ‘cassandra:geo_counter’, ‘discovery’ : ‘aws_tags’, ‘owner’: ‘gsi’, ‘task_id’: ‘abcd-ef123’,

#Anatomy of a Taskerman Task ‘taskerman_params’: { ‘action_args’: {‘force’: true}, ‘workqueue_args’: {‘retry_count’:3}, }, ‘nodes’: [], ‘destnode’: ‘’, } # force=true for restart, retry_count for queue # [a,b,c,d] to skip discovery

Task Scheduler ● Runs on Chronos ● Emits a task ● Enqueues into global queue ● Ad-hoc invocation ● Deployment granularities ● Task tracking ● Yelpsoa-configs PaaSTA

WorkQueue ● AWS SQS ● Best-effort FIFO ● Reliable and cheap ● Low latency ● Properties ○ Read without delete AWS SQS ○ Visibility timeout ○ Retry ○ Dead Letter Queue

Task Router ● Stateless Marathon worker ● Routes tasks to clusters ● Custom routing logic ● At-least once delivery ● ‘DNS’ of Taskerman ● Pluggable discovery ○ AWS PaaSTA ○ Smartstack

TaskRunner ● The executor of Taskerman ● Dequeue task and executes ○ Pre-defined reviewed code. ● Cron-ed on node ● Zookeeper for coordination ● Task deleted upon success ● Dead letter queue upon failed retries

class TestTaskRunner(TaskRunner): def __init__(self, task,..): # State mgmt and datastore specific def pre_check(self): # Is the task safe to execute on this cluster def execute_action(self): # Actual execution of task:action def post_check(self): # cluster good after execution or is it on fire

Zookeeper ● Distributed Coordinator ● Non Blocking Lease ○ Time-based lease ○ Global lease ● Ephemeral locks ● Atomic Counters ○ Statistics ○ Circuit breaker

Zookeeper: Challenges ● Staleness ○ Nodes can go down ● Garbage collection ○ Cleanup of ZK data structures ● Composition ● Starvation ● Uptime

Deployment ● Puppet ● Terraform ● Yelpsoa-configs ● PaaSTA ● Jenkins ● AWS Lambda PaaSTA

Failure handling ● Multiple vectors of failure ● Idempotency ● Pessimistic approach ○ Job retry ● Separation of state ● Mutability ● Highly available components ● Circuit breakers

Debugging

Failure detection ● Heartbeat ping ○ End-to-end monitoring ● Dead Letter Queue ○ Recycle bin of failed tasks. ○ Hooks into human side of monitoring ● Status check

Monitoring ● End-to-end logging ○ Un/structured ● Metrics ○ Counters ○ Queue lengths ● Aggregation and dashboards ● Staleness checks ● Dead Letter Queue ● Multi-modal Alerting

Use cases ● Restarts ● Reboots ● Instance Replacement ● Integration tests ● Kafka config reload ● Failure injection ● Backup and restore ● Search indexing ● .. and many more.

Scheduled Backups ● Safety ● Cassandra ● Elasticsearch ● Common issues ● Constraints ○ Limit ○ Healthcheck ○ Mutual exclusion

$ uptime 06:52:54 up 99 days, 19:20, 1 user, load average: 0.02, 0.03, 0.07 Secure Infrastructure ps -eo pid,cmd,lstart | grep .. 10058 zookeeper Tue Dec 5 05:23:43 2017

We're Hiring! www.yelp.com/careers/

fb.com/YelpEngineers @YelpEngineering engineeringblog.yelp.com github.com/yelp

Q & A ● Slides will also be uploaded to slideshare.net/slidunder.

Q & A ❖ Q: What challenges remain with Taskerman. ➢ A: ❖ Q: … ➢ A: …

Image Credits ● https://www.elastic.co/products/elasticsearch ● https://zookeeper.apache.org/ ● https://kafka.apache.org/ ● https://www.flickr.com/photos/dapuglet/6291424431 ● http://www.alamy.com/stock-photo/cattle-penning.html ● http://www.firstcallsigns.co.uk/content/images/thumbs/0000927_EE80127.jpeg ● https://sensuapp.org/img/logo-flat-white.png ● https://thumbs.gfycat.com/FocusedCompetentEyas-max-1mb.gif ● https://www.percona.com/sites/default/files/dashboard.png ● https://www.sales-initiative.com/downloads/2856/download/resilience.jpg?cb=29f43ac82cea225ab3ee370d7580760d ● http://izquotes.com/quotes-pictures/quote-a-distributed-system-is-one-in-which-the-failure-of-a-computer-you-didn-t-eve n-know-existed-can-leslie-lamport-346227.jpg ● https://pbs.twimg.com/media/DRCfqaCWsAczqTz.jpg ● https://upload.wikimedia.org/wikipedia/en/thumb/e/e0/Iron_Man_bleeding_edge.jpg/220px-Iron_Man_bleeding_edge.jpg ● https://github.com/mesos/chronos ● https://github.com/mesosphere

Image Credits ● http://www.networknuts-web.biz/wp-content/uploads/2014/10/cron-logo.png ● http://www.pvhc.net/img195/ojfspebrvfblupftgajb.png ● https://fun-damentals.com/wp-content/uploads/2016/05/a-resilience.png ● http://www.azquotes.com/picture-quotes/quote-debugging-is-twice-as-hard-as-writing-the-code-in-the-first-place-therefor e-if-you-write-brian-kernighan-66-91-06.jpg ● https://thenounproject.com/ ● https://aws.amazon.com/ ● https://www.splunk.com/ ● https://www.terraform.io/ ● http://yelp.com ● http://cloudscaling.com/blog/cloud-computing/the-history-of-pets-vs-cattle/

Further Reading ● https://engineeringblog.yelp.com/2015/03/using-services-to-break-down-monoliths.html ● http://cloudscaling.com/blog/cloud-computing/the-history-of-pets-vs-cattle/ ● https://martinfowler.com/bliki/TwoHardThings.html ● https://zookeeper.apache.org/ ● https://www.terraform.io/ ● https://github.com/Yelp/service-principles ● https://en.wikipedia.org/wiki/Law_of_Demeter

Taskerman A Distributed Cluster Task Manager Raghavendra D Prabhu - PowerPoint PPT Presentation

Taskerman A Distributed Cluster Task Manager Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer Distributed Systems Yelps Mission Connecting people with great local businesses. Datastore Ecosystem @ Cassandra Elasticsearch Zookeeper

hypertext, multimedia finding things finding things navigating hyperspace and the

The Electromagnetic Spectrum Principles of Astrophysics & Cosmology - Professor Jodi Cooley

Todays Presenters Kendra Jones Childrens Librarian, Tacoma Public Library, WA Soraya

Julia tutorial Introduction Some useful pointers Getting started Julia syntax

Animation Why animation? Realism and fluidity of interaction Open/Close Spring-back

Animation Presented By Timothy Chan Outline 1. Principles of Traditional Animation Applied to

Transitions and Transforms Introduction to CSS Animations PRESENTED BY Homer Christensen ABOUT

CSS Transforms, Transitions, and Animation Basics WordCamp Northeast Ohio 2016 Beth Soderberg

Animation Ma Maneesh Agrawala CS 448B: Visualization WInter 2020 1 Last Time: Color 2 1

Computer Animation Karen Liu associate professor at School of Interactive Computing Murali

11.2 Animation Principles Hao Li http://cs420.hao-li.com 1 Additional Reading Rango: Character

Custom Drawing & Animation CS 442: Mobile App Development Michael Saelee <lee@iit.edu>

React Native Composing Animations 1 Overview Goals of animation Stationary objects must

Animation Piech, CS106A, Stanford University Our story so far Piech, CS106A, Stanford

aComment : Mining Annotations from Comments and Code to Detect Interrupt-Related Concurrency Bugs

CS 418: Interactive Computer Graphics Basic Animation Eric Shaffer Simple Animation with WebGL

#11: Side-Scrolling and Time-Based Animation SAMS SENIOR CS TRACK Last Time Used input and

The Image that called me Active Content Injection with SVG Files A presentation by Mario

CS324e - Elements of Graphics and Visualization Animation in Java3D Adding Animation

Building Java Programs Chapter 3 Lecture 6: Graphics Reading: Supplement 3G 2 Objects

Chapter 15 Recursion CS1: Java Programming Colorado State University Original slides by Daniel

Animation Example: Cloth time evolution of a mesh subject to Simulation with Meshes

iOS Animation with Swift Part 4: Advanced View Animations Using an auxiliary view start end

Available and Upcoming Web Graphics Standards Canon, Inc. Jun Fujisawa fujisawa.jun@canon.co.jp

Sambuz

Useful Links

Newsletter

Mail Us

Taskerman A Distributed Cluster Task Manager Raghavendra D Prabhu - PowerPoint PPT Presentation

Taskerman A Distributed Cluster Task Manager Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer Distributed Systems Yelps Mission Connecting people with great local businesses. Datastore Ecosystem @ Cassandra Elasticsearch Zookeeper

hypertext, multimedia finding things finding things navigating hyperspace and the

The Electromagnetic Spectrum Principles of Astrophysics &amp; Cosmology - Professor Jodi Cooley

Todays Presenters Kendra Jones Childrens Librarian, Tacoma Public Library, WA Soraya

Julia tutorial Introduction Some useful pointers Getting started Julia syntax

Animation Why animation? Realism and fluidity of interaction Open/Close Spring-back

Animation Presented By Timothy Chan Outline 1. Principles of Traditional Animation Applied to

Transitions and Transforms Introduction to CSS Animations PRESENTED BY Homer Christensen ABOUT

CSS Transforms, Transitions, and Animation Basics WordCamp Northeast Ohio 2016 Beth Soderberg

Animation Ma Maneesh Agrawala CS 448B: Visualization WInter 2020 1 Last Time: Color 2 1

Computer Animation Karen Liu associate professor at School of Interactive Computing Murali

11.2 Animation Principles Hao Li http://cs420.hao-li.com 1 Additional Reading Rango: Character

Custom Drawing &amp; Animation CS 442: Mobile App Development Michael Saelee &lt;lee@iit.edu&gt;

React Native Composing Animations 1 Overview Goals of animation Stationary objects must

Animation Piech, CS106A, Stanford University Our story so far Piech, CS106A, Stanford

aComment : Mining Annotations from Comments and Code to Detect Interrupt-Related Concurrency Bugs

CS 418: Interactive Computer Graphics Basic Animation Eric Shaffer Simple Animation with WebGL

#11: Side-Scrolling and Time-Based Animation SAMS SENIOR CS TRACK Last Time Used input and

The Image that called me Active Content Injection with SVG Files A presentation by Mario

CS324e - Elements of Graphics and Visualization Animation in Java3D Adding Animation

Building Java Programs Chapter 3 Lecture 6: Graphics Reading: Supplement 3G 2 Objects

Chapter 15 Recursion CS1: Java Programming Colorado State University Original slides by Daniel

Animation Example: Cloth time evolution of a mesh subject to Simulation with Meshes

iOS Animation with Swift Part 4: Advanced View Animations Using an auxiliary view start end

Available and Upcoming Web Graphics Standards Canon, Inc. Jun Fujisawa fujisawa.jun@canon.co.jp

Sambuz

Useful Links

Newsletter

Mail Us

The Electromagnetic Spectrum Principles of Astrophysics & Cosmology - Professor Jodi Cooley

Custom Drawing & Animation CS 442: Mobile App Development Michael Saelee <lee@iit.edu>