multi tenant machine learning apache aurora apache mesos
play

Multi-tenant Machine Learning Apache Aurora & Apache Mesos - PowerPoint PPT Presentation

Multi-tenant Machine Learning Apache Aurora & Apache Mesos Stephan Erb serb@apache.org 2016.11.15 @ErbStephan Apache Aurora https://aurora.apache.org Mesos


  1. Multi-tenant Machine Learning Apache Aurora & Apache Mesos Stephan Erb serb@apache.org 2016.11.15 @ErbStephan

  2. Apache Aurora https://aurora.apache.org Mesos framework for the deployment and scaling of stateless and fault tolerant services in a datacenter Apache Mesos https://mesos.apache.org Cluster manager providing fault-tolerant, fj ne-grained multitenancy via containers

  3. Apache Aurora https://aurora.apache.org “distributed supervisord" Apache Mesos https://mesos.apache.org “plumbing”

  4. Cluster Manager

  5. Cluster Manager

  6. 
 Aurora Example webservice = Process ( name = 'webservice', 
 cmdline = ‘./run_my_webservice.py’) task = Task ( processes = [webservice], resources = Resources(cpu=4, ram=4*GB, disk=8*GB)) jobs = [ Job ( task=task, 
 instances=4, constraints = {'host': 'limit:1'}, service=True, 
 cluster=‘rz1', role=‘www’, environment=‘prod’, name=‘webserver’), ]

  7. Aurora Example $ aurora update start rz1/www/prod/webserver \ webserver.aurora

  8. ● ● Coordinator Node Aurora Scheduler Zookeeper Mesos Master State ● ● Worker Node Mesos Agent ● ● Task (Container) Aurora User Code Executor

  9. Photo by liz west https:// fm ic.kr/p/7qYh21

  10. ● ● Customer System • Predictions Data Delivery • Decisions ● ● ● ● Tenant / ML model Tenant / ML model Historic Tenant Data Compute Platform

  11. Key Achievement Data scientists deploy to production.

  12. Key Achievement Data scientists deploy to production.

  13. Key Achievement Data scientists deploy to production.

  14. bigger VM/Host VM/Host

  15. Data larger than RAM Implementation Choices: • semi- external implementation (out-of-core) • communication-e ffi cient distributed memory implementation • streaming (aka “large data volumes are hard, in fj nite data is easy”)

  16. Data larger than RAM Implementation Choices: • semi- external implementation (out-of-core) • communication-e ffi cient distributed memory implementation • streaming (aka “large data volumes are hard, in fj nite data is easy”)

  17. Domain-speci fj c Problem Decomposition # Compute on whole data set # compute_prediction(data) # Compute on partitioned data # # (this is rather restrictive but tends to # work great for many usecases) # for chunk in partition(data): compute_prediction(chunk)

  18. Python Scheduling Master • manages job graphs • guarantees fault tolerance Workers • run python functions • distributable • dynamic worker count http://www.celeryproject.org/ http://distributed.readthedocs.io/en/latest/

  19. Cluster Scheduling Project/ Tenant Compute Cluster

  20. Cluster Scheduling Project/ Tenant Compute Cluster

  21. Cluster Scheduling Project/ Tenant Compute Cluster

  22. Key Idea Multi-tenancy via multi- instance deployments

  23. Good multi-tenancy is hard enough that it just doesn’t happen by accident. — Jay Kreps https://www.con fm uent.io/blog/sharing-is-caring-multi-tenancy-in-distributed-data-systems

  24. Multi-tenant Features Aurora Mesos • Structured job keys • Linux users • role (tenant01, …) • Filesystem isolation via • environments (devel, …) Docker/Appc containers • name • CPU/RAM isolation via • Job tiers/priorities cgroups • Linux namespaces (pid, • Quota & preemption network, …) • Multi-framework support

  25. Merits and Pitfalls? Multiple frameworks on the same Mesos cluster

  26. Feature Dimensions User Operator • long-running services • high-availability • maintenance primitives • cron jobs & adhoc jobs • resource quotas and • rolling job updates, with automatic rollback preemption • service announcement • instrumented for in ZooKeeper monitoring and • scheduling constraints debugging • oversubscription • Docker/Appc support • self-service UIs

  27. Oversubscription https://github.com/blue-yonder/mesos-threshold-oversubscription

  28. Executive Summary In this talk, we have seen: • Aurora & Mesos provide excellent support for heterogenous workloads. • They can even be used by data scientists to ship machine learning models into production. • All without major headache for your operations team.

  29. Thank you! Stephan Erb serb@apache.org 2016.11.15 @ErbStephan

Recommend


More recommend