Taskerman A Distributed Cluster Task Manager Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer Distributed Systems
Yelp’s Mission Connecting people with great local businesses.
Datastore Ecosystem @
Cassandra Elasticsearch Zookeeper PostgreSQL
Any many more.. …. ● Memcached ● Redis ● Spark ● Redshift ● DynamoDB ● PaaStorm ● S3 5
Distributed Systems ● Several TB in Cassandra clusters with tens of nodes each ● Close to a million messages/second in streaming pipeline ● Several TB in Elasticsearch with several hundred nodes in each ● Many PB archived to S3 every month ● Multi-AZ Multi-Region ● And growing…
“Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra cluster without spiking read latency” “Reboot 1000 instances without taking a millennia but not bringing down site either” “Upgrade an Elasticsearch cluster from m3.medium to m3.xlarge safely without downtime”
Pet vs Cattle
Maintenance Cost Engineering Efficiency Scalability
Taskerman
Requirements ● Safe ● Security ● Generic and Extensible ● Distributed ● Loosely coupled ● Cluster awareness
Desirable ● Schedulable ● Reusable ● Auditability ○ Not Ad-hoc ○ More Declarative, Less Imperative ○ Config Management ● Maintainability ● Observability ● Resilience
Safety ● Paramount* ● Serialized execution ○ ‘m’ out of ‘n’ ○ Disjoint jobs. ● Avoid cascade ● Privilege escalation ● Pull-based * Unless oncall is automated too.
Fallacies of Distributed System ● Network is reliable ● Latency is zero ● Bandwidth is infinite ● Network is secure ● One administrator ● Transport cost is zero ● Network is homogenous ● Topology doesn't change
Quotes There are only two hard problems in distributed systems: 2. Exactly-once delivery 1. Guaranteed order of messages 2. Exactly-once delivery @mathiasverraes There are 2 hard problems in computer science: cache invalidation, naming things, and off-by-1 errors. @secretGeek
Building Blocks ● Scheduler ● Router ● Co-ordinator ● Transport ● Executor ● Error handler ● Configuration ● Monitoring ● Tooling
Queue Router Task Scheduler EC2 API Zookeeper Dead Failure Node Queues Letter Queue Q1 Q3 Q2 Workqueue Retries Lease Flow of task T1 T3 T2 Cluster
#Anatomy of a Taskerman Task # Restart action for 2 nodes of geo_counter # cassandra cluster owned by gsi { ‘action’: ‘cassandra_task:restart’, ‘version’: 1.2, ‘limit’: 2, ‘cluster_name’: ‘cassandra:geo_counter’, ‘discovery’ : ‘aws_tags’, ‘owner’: ‘gsi’, ‘task_id’: ‘abcd-ef123’,
#Anatomy of a Taskerman Task ‘taskerman_params’: { ‘action_args’: {‘force’: true}, ‘workqueue_args’: {‘retry_count’:3}, }, ‘nodes’: [], ‘destnode’: ‘’, } # force=true for restart, retry_count for queue # [a,b,c,d] to skip discovery
Queue Router Task Scheduler EC2 API Zookeeper Dead Failure Node Queues Letter Queue Q1 Q3 Q2 Workqueue Retries Lease Flow of task T1 T3 T2 Cluster
Task Scheduler ● Runs on Chronos ● Emits a task ● Enqueues into global queue ● Ad-hoc invocation ● Deployment granularities ● Task tracking ● Yelpsoa-configs PaaSTA
Queue Router Task Scheduler EC2 API Zookeeper Dead Failure Node Queues Letter Queue Q1 Q3 Q2 Workqueue Retries Lease Flow of task T1 T3 T2 Cluster
WorkQueue ● AWS SQS ● Best-effort FIFO ● Reliable and cheap ● Low latency ● Properties ○ Read without delete AWS SQS ○ Visibility timeout ○ Retry ○ Dead Letter Queue
Queue Router Task Scheduler EC2 API Zookeeper Dead Failure Node Queues Letter Queue Q1 Q3 Q2 Workqueue Retries Lease Flow of task T1 T3 T2 Cluster
Task Router ● Stateless Marathon worker ● Routes tasks to clusters ● Custom routing logic ● At-least once delivery ● ‘DNS’ of Taskerman ● Pluggable discovery ○ AWS PaaSTA ○ Smartstack
Queue Router Task Scheduler EC2 API Zookeeper Dead Failure Node Queues Letter Queue Q1 Q3 Q2 Workqueue Retries Lease Flow of task T1 T3 T2 Cluster
TaskRunner ● The executor of Taskerman ● Dequeue task and executes ○ Pre-defined reviewed code. ● Cron-ed on node ● Zookeeper for coordination ● Task deleted upon success ● Dead letter queue upon failed retries
class TestTaskRunner(TaskRunner): def __init__(self, task,..): # State mgmt and datastore specific def pre_check(self): # Is the task safe to execute on this cluster def execute_action(self): # Actual execution of task:action def post_check(self): # cluster good after execution or is it on fire
Queue Router Task Scheduler EC2 API Zookeeper Dead Failure Node Queues Letter Queue Q1 Q3 Q2 Workqueue Retries Lease Flow of task T1 T3 T2 Cluster
Zookeeper ● Distributed Coordinator ● Non Blocking Lease ○ Time-based lease ○ Global lease ● Ephemeral locks ● Atomic Counters ○ Statistics ○ Circuit breaker
Zookeeper: Challenges ● Staleness ○ Nodes can go down ● Garbage collection ○ Cleanup of ZK data structures ● Composition ● Starvation ● Uptime
Deployment ● Puppet ● Terraform ● Yelpsoa-configs ● PaaSTA ● Jenkins ● AWS Lambda PaaSTA
Failure handling ● Multiple vectors of failure ● Idempotency ● Pessimistic approach ○ Job retry ● Separation of state ● Mutability ● Highly available components ● Circuit breakers
Debugging
Failure detection ● Heartbeat ping ○ End-to-end monitoring ● Dead Letter Queue ○ Recycle bin of failed tasks. ○ Hooks into human side of monitoring ● Status check
Monitoring ● End-to-end logging ○ Un/structured ● Metrics ○ Counters ○ Queue lengths ● Aggregation and dashboards ● Staleness checks ● Dead Letter Queue ● Multi-modal Alerting
Use cases ● Restarts ● Reboots ● Instance Replacement ● Integration tests ● Kafka config reload ● Failure injection ● Backup and restore ● Search indexing ● .. and many more.
Scheduled Backups ● Safety ● Cassandra ● Elasticsearch ● Common issues ● Constraints ○ Limit ○ Healthcheck ○ Mutual exclusion
$ uptime 06:52:54 up 99 days, 19:20, 1 user, load average: 0.02, 0.03, 0.07 Secure Infrastructure ps -eo pid,cmd,lstart | grep .. 10058 zookeeper Tue Dec 5 05:23:43 2017
We're Hiring! www.yelp.com/careers/
fb.com/YelpEngineers @YelpEngineering engineeringblog.yelp.com github.com/yelp
Q & A ● Slides will also be uploaded to slideshare.net/slidunder.
Q & A ❖ Q: What challenges remain with Taskerman. ➢ A: ❖ Q: … ➢ A: …
Image Credits ● https://www.elastic.co/products/elasticsearch ● https://zookeeper.apache.org/ ● https://kafka.apache.org/ ● https://www.flickr.com/photos/dapuglet/6291424431 ● http://www.alamy.com/stock-photo/cattle-penning.html ● http://www.firstcallsigns.co.uk/content/images/thumbs/0000927_EE80127.jpeg ● https://sensuapp.org/img/logo-flat-white.png ● https://thumbs.gfycat.com/FocusedCompetentEyas-max-1mb.gif ● https://www.percona.com/sites/default/files/dashboard.png ● https://www.sales-initiative.com/downloads/2856/download/resilience.jpg?cb=29f43ac82cea225ab3ee370d7580760d ● http://izquotes.com/quotes-pictures/quote-a-distributed-system-is-one-in-which-the-failure-of-a-computer-you-didn-t-eve n-know-existed-can-leslie-lamport-346227.jpg ● https://pbs.twimg.com/media/DRCfqaCWsAczqTz.jpg ● https://upload.wikimedia.org/wikipedia/en/thumb/e/e0/Iron_Man_bleeding_edge.jpg/220px-Iron_Man_bleeding_edge.jpg ● https://github.com/mesos/chronos ● https://github.com/mesosphere
Image Credits ● http://www.networknuts-web.biz/wp-content/uploads/2014/10/cron-logo.png ● http://www.pvhc.net/img195/ojfspebrvfblupftgajb.png ● https://fun-damentals.com/wp-content/uploads/2016/05/a-resilience.png ● http://www.azquotes.com/picture-quotes/quote-debugging-is-twice-as-hard-as-writing-the-code-in-the-first-place-therefor e-if-you-write-brian-kernighan-66-91-06.jpg ● https://thenounproject.com/ ● https://aws.amazon.com/ ● https://www.splunk.com/ ● https://www.terraform.io/ ● http://yelp.com ● http://cloudscaling.com/blog/cloud-computing/the-history-of-pets-vs-cattle/
Further Reading ● https://engineeringblog.yelp.com/2015/03/using-services-to-break-down-monoliths.html ● http://cloudscaling.com/blog/cloud-computing/the-history-of-pets-vs-cattle/ ● https://martinfowler.com/bliki/TwoHardThings.html ● https://zookeeper.apache.org/ ● https://www.terraform.io/ ● https://github.com/Yelp/service-principles ● https://en.wikipedia.org/wiki/Law_of_Demeter
Recommend
More recommend