taskerman
play

Taskerman A Distributed Cluster Task Manager Raghavendra D Prabhu - PowerPoint PPT Presentation

Taskerman A Distributed Cluster Task Manager Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer Distributed Systems Yelps Mission Connecting people with great local businesses. Datastore Ecosystem @ Cassandra Elasticsearch Zookeeper


  1. Taskerman A Distributed Cluster Task Manager Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer Distributed Systems

  2. Yelp’s Mission Connecting people with great local businesses.

  3. Datastore Ecosystem @

  4. Cassandra Elasticsearch Zookeeper PostgreSQL

  5. Any many more.. …. ● Memcached ● Redis ● Spark ● Redshift ● DynamoDB ● PaaStorm ● S3 5

  6. Distributed Systems ● Several TB in Cassandra clusters with tens of nodes each ● Close to a million messages/second in streaming pipeline ● Several TB in Elasticsearch with several hundred nodes in each ● Many PB archived to S3 every month ● Multi-AZ Multi-Region ● And growing…

  7. “Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra cluster without spiking read latency” “Reboot 1000 instances without taking a millennia but not bringing down site either” “Upgrade an Elasticsearch cluster from m3.medium to m3.xlarge safely without downtime”

  8. Pet vs Cattle

  9. Maintenance Cost Engineering Efficiency Scalability

  10. Taskerman

  11. Requirements ● Safe ● Security ● Generic and Extensible ● Distributed ● Loosely coupled ● Cluster awareness

  12. Desirable ● Schedulable ● Reusable ● Auditability ○ Not Ad-hoc ○ More Declarative, Less Imperative ○ Config Management ● Maintainability ● Observability ● Resilience

  13. Safety ● Paramount* ● Serialized execution ○ ‘m’ out of ‘n’ ○ Disjoint jobs. ● Avoid cascade ● Privilege escalation ● Pull-based * Unless oncall is automated too.

  14. Fallacies of Distributed System ● Network is reliable ● Latency is zero ● Bandwidth is infinite ● Network is secure ● One administrator ● Transport cost is zero ● Network is homogenous ● Topology doesn't change

  15. Quotes There are only two hard problems in distributed systems: 2. Exactly-once delivery 1. Guaranteed order of messages 2. Exactly-once delivery @mathiasverraes There are 2 hard problems in computer science: cache invalidation, naming things, and off-by-1 errors. @secretGeek

  16. Building Blocks ● Scheduler ● Router ● Co-ordinator ● Transport ● Executor ● Error handler ● Configuration ● Monitoring ● Tooling

  17. Queue Router Task Scheduler EC2 API Zookeeper Dead Failure Node Queues Letter Queue Q1 Q3 Q2 Workqueue Retries Lease Flow of task T1 T3 T2 Cluster

  18. #Anatomy of a Taskerman Task # Restart action for 2 nodes of geo_counter # cassandra cluster owned by gsi { ‘action’: ‘cassandra_task:restart’, ‘version’: 1.2, ‘limit’: 2, ‘cluster_name’: ‘cassandra:geo_counter’, ‘discovery’ : ‘aws_tags’, ‘owner’: ‘gsi’, ‘task_id’: ‘abcd-ef123’,

  19. #Anatomy of a Taskerman Task ‘taskerman_params’: { ‘action_args’: {‘force’: true}, ‘workqueue_args’: {‘retry_count’:3}, }, ‘nodes’: [], ‘destnode’: ‘’, } # force=true for restart, retry_count for queue # [a,b,c,d] to skip discovery

  20. Queue Router Task Scheduler EC2 API Zookeeper Dead Failure Node Queues Letter Queue Q1 Q3 Q2 Workqueue Retries Lease Flow of task T1 T3 T2 Cluster

  21. Task Scheduler ● Runs on Chronos ● Emits a task ● Enqueues into global queue ● Ad-hoc invocation ● Deployment granularities ● Task tracking ● Yelpsoa-configs PaaSTA

  22. Queue Router Task Scheduler EC2 API Zookeeper Dead Failure Node Queues Letter Queue Q1 Q3 Q2 Workqueue Retries Lease Flow of task T1 T3 T2 Cluster

  23. WorkQueue ● AWS SQS ● Best-effort FIFO ● Reliable and cheap ● Low latency ● Properties ○ Read without delete AWS SQS ○ Visibility timeout ○ Retry ○ Dead Letter Queue

  24. Queue Router Task Scheduler EC2 API Zookeeper Dead Failure Node Queues Letter Queue Q1 Q3 Q2 Workqueue Retries Lease Flow of task T1 T3 T2 Cluster

  25. Task Router ● Stateless Marathon worker ● Routes tasks to clusters ● Custom routing logic ● At-least once delivery ● ‘DNS’ of Taskerman ● Pluggable discovery ○ AWS PaaSTA ○ Smartstack

  26. Queue Router Task Scheduler EC2 API Zookeeper Dead Failure Node Queues Letter Queue Q1 Q3 Q2 Workqueue Retries Lease Flow of task T1 T3 T2 Cluster

  27. TaskRunner ● The executor of Taskerman ● Dequeue task and executes ○ Pre-defined reviewed code. ● Cron-ed on node ● Zookeeper for coordination ● Task deleted upon success ● Dead letter queue upon failed retries

  28. class TestTaskRunner(TaskRunner): def __init__(self, task,..): # State mgmt and datastore specific def pre_check(self): # Is the task safe to execute on this cluster def execute_action(self): # Actual execution of task:action def post_check(self): # cluster good after execution or is it on fire

  29. Queue Router Task Scheduler EC2 API Zookeeper Dead Failure Node Queues Letter Queue Q1 Q3 Q2 Workqueue Retries Lease Flow of task T1 T3 T2 Cluster

  30. Zookeeper ● Distributed Coordinator ● Non Blocking Lease ○ Time-based lease ○ Global lease ● Ephemeral locks ● Atomic Counters ○ Statistics ○ Circuit breaker

  31. Zookeeper: Challenges ● Staleness ○ Nodes can go down ● Garbage collection ○ Cleanup of ZK data structures ● Composition ● Starvation ● Uptime

  32. Deployment ● Puppet ● Terraform ● Yelpsoa-configs ● PaaSTA ● Jenkins ● AWS Lambda PaaSTA

  33. Failure handling ● Multiple vectors of failure ● Idempotency ● Pessimistic approach ○ Job retry ● Separation of state ● Mutability ● Highly available components ● Circuit breakers

  34. Debugging

  35. Failure detection ● Heartbeat ping ○ End-to-end monitoring ● Dead Letter Queue ○ Recycle bin of failed tasks. ○ Hooks into human side of monitoring ● Status check

  36. Monitoring ● End-to-end logging ○ Un/structured ● Metrics ○ Counters ○ Queue lengths ● Aggregation and dashboards ● Staleness checks ● Dead Letter Queue ● Multi-modal Alerting

  37. Use cases ● Restarts ● Reboots ● Instance Replacement ● Integration tests ● Kafka config reload ● Failure injection ● Backup and restore ● Search indexing ● .. and many more.

  38. Scheduled Backups ● Safety ● Cassandra ● Elasticsearch ● Common issues ● Constraints ○ Limit ○ Healthcheck ○ Mutual exclusion

  39. $ uptime 06:52:54 up 99 days, 19:20, 1 user, load average: 0.02, 0.03, 0.07 Secure Infrastructure ps -eo pid,cmd,lstart | grep .. 10058 zookeeper Tue Dec 5 05:23:43 2017

  40. We're Hiring! www.yelp.com/careers/

  41. fb.com/YelpEngineers @YelpEngineering engineeringblog.yelp.com github.com/yelp

  42. Q & A ● Slides will also be uploaded to slideshare.net/slidunder.

  43. Q & A ❖ Q: What challenges remain with Taskerman. ➢ A: ❖ Q: … ➢ A: …

  44. Image Credits ● https://www.elastic.co/products/elasticsearch ● https://zookeeper.apache.org/ ● https://kafka.apache.org/ ● https://www.flickr.com/photos/dapuglet/6291424431 ● http://www.alamy.com/stock-photo/cattle-penning.html ● http://www.firstcallsigns.co.uk/content/images/thumbs/0000927_EE80127.jpeg ● https://sensuapp.org/img/logo-flat-white.png ● https://thumbs.gfycat.com/FocusedCompetentEyas-max-1mb.gif ● https://www.percona.com/sites/default/files/dashboard.png ● https://www.sales-initiative.com/downloads/2856/download/resilience.jpg?cb=29f43ac82cea225ab3ee370d7580760d ● http://izquotes.com/quotes-pictures/quote-a-distributed-system-is-one-in-which-the-failure-of-a-computer-you-didn-t-eve n-know-existed-can-leslie-lamport-346227.jpg ● https://pbs.twimg.com/media/DRCfqaCWsAczqTz.jpg ● https://upload.wikimedia.org/wikipedia/en/thumb/e/e0/Iron_Man_bleeding_edge.jpg/220px-Iron_Man_bleeding_edge.jpg ● https://github.com/mesos/chronos ● https://github.com/mesosphere

  45. Image Credits ● http://www.networknuts-web.biz/wp-content/uploads/2014/10/cron-logo.png ● http://www.pvhc.net/img195/ojfspebrvfblupftgajb.png ● https://fun-damentals.com/wp-content/uploads/2016/05/a-resilience.png ● http://www.azquotes.com/picture-quotes/quote-debugging-is-twice-as-hard-as-writing-the-code-in-the-first-place-therefor e-if-you-write-brian-kernighan-66-91-06.jpg ● https://thenounproject.com/ ● https://aws.amazon.com/ ● https://www.splunk.com/ ● https://www.terraform.io/ ● http://yelp.com ● http://cloudscaling.com/blog/cloud-computing/the-history-of-pets-vs-cattle/

  46. Further Reading ● https://engineeringblog.yelp.com/2015/03/using-services-to-break-down-monoliths.html ● http://cloudscaling.com/blog/cloud-computing/the-history-of-pets-vs-cattle/ ● https://martinfowler.com/bliki/TwoHardThings.html ● https://zookeeper.apache.org/ ● https://www.terraform.io/ ● https://github.com/Yelp/service-principles ● https://en.wikipedia.org/wiki/Law_of_Demeter

Recommend


More recommend