TensorFlow: A system for large-scale machine learning Martn Abadi - PowerPoint PPT Presentation

TensorFlow: A system for large-scale machine learning Martín Abadi et. al, 2016 Presented by Harrison Brown for R244

Background • Originally built by Google engineers as successor to proprietary system for distributed training called DistBelief • DistBelief paper published, code not released • DistBelief uses parameter server architecture • Stateless workers, stateful parameter servers • Machine learning algorithms • DAG that terminates with a loss function, backpropagation, SGD • TensorFlow used internally at Google before being released as open source • Dataflow architecture

4 Extensions • New layers • DistBelief uses C++, limits ability for researchers to experiment • Refining training Algorithms • SGD can be optimized in several ways (Adam, AdaGrad, etc) • DistBelief requires modifications of parameter server implementation • New training algorithms • Need system that works well for other ML algorithms besides feed-forward NNs (ex. Adversarial networks, reinforcement learning, expectation- maximization etc) • Ease of prototyping on local machines, GPU acceleration

https://www.tensorflow.or g/tensorboard/r1/graphs

Comparison • Torch • Imperative model, control over execution and performance • Lack of dataflow graph hurts experimentation, training, and ease of deployment • Caffe • Easy to create new models with existing layers, but difficult for research into new models or optimizers, not extensible • Focus on CNNs (at time of paper) difficult to use RNNs • Theano • Computation graph, mathematical operations, control flow and loops. Flexible • Difficult to scale • MXNet • Computation graph, runs and scales very efficiently

Technical Design • High-level scripting interface, ease of use, research oriented • Individual mathematical operators are nodes in dataflow • Easier to compose novel layers • Two phases • Define program as symbolic graph • Execute optimized version on available devices • Common abstraction for accelerators • Operations on Tensors • Tasks (PS tasks and worker tasks)

Execution • Single dataflow graph • Supports multiple concurrent executions on overlapping subgraphs • Vertices (Operations) with mutable state • Permits in place updates • Takes in m tensors as input, n tensors as output • Tensors • N-dimensional arrays with small number of primitive types • Can support asynchronous and synchronized execution • Lock free SGD is most common • Allows operations to be manually placed • Automatic differentiation of control flow constructs

Implementation • C++ implementation for performance, can run on standard architectures • Master obtains subgraphs for each device • Executor handles requests from the master • Tooling support (graph visualization, profiler for traces, etc)

Evaluation examples • Designed to be fast, not the fastest • MxNet comparison on image classification • Demonstrate the scalability

Impact • One of the most popular systems for machine learning • Adopted very quickly • Used widely in industry and in research • Built for machine learning, but general enough for other computations • The original TensorFlow is high-quality software, built to be extensible • Over 60,000 commits and ~2.4 million lines of code today • TensorFlow (arguably) killed Theano as it is nearly a complete replacement

Issues • Static dataflow graphs places limitations on some algorithms such as deep reinforcement learning • The Ray project attempts to address some of these issues • Fault tolerance doesn't account for strong consistency potentially needed by some algorithms • Note, the overhead required has a drastic change in performance • Stated MxNet performance nearly identical in this paper, however that may not be the case

Questions?

Sources • [1] M. Abadi et al. Tensorflow: A system for large-scale machine learning. OSDI, 2016. • [2] M. Abadi, M. Isard and D. Murray: A Computational Model for TensorFlow - An Introduction, MAPL, 2017 • [3] Team, The Theano Development, et al. Theano: A Python framework for fast computation of mathematical expressions. arXiv preprint arXiv:1605.02688, 2016. • [4] TensorFlow, 2019. www.tensorflow.org

TensorFlow: A system for large-scale machine learning Martn Abadi - PowerPoint PPT Presentation

TensorFlow: A system for large-scale machine learning Martn Abadi et. al, 2016 Presented by Harrison Brown for R244 Background Originally built by Google engineers as successor to proprietary system for distributed training called

TensorFlow: a Framework for Scalable Machine Learning ACM Learning Center, 2016 You probably

C-FX-02-V1.0 DSV 4.0 2 45 15 TensorFlow TensorBoard TensorFlow

Getting Started with TensorFlow Part I: TensorFlow Graphs and Sessions Nick Winovich Department

TensorFlow: neural networks lab Paolo Dragone and Andrea Passerini paolo.dragone@unitn.it

TensorFlow: A System for Learning-Scale Machine Learning Google Brain The Problem Machine

Tensorflow - A system for large-scale machine learning Presentation: Nat McAleese (nm583)

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

Distributed TensorFlow Stony Brook University CSE545, Fall 2017 Goals Understand

A Trip Through the NGC TensorFlow Container GTC 2019 S9256 AGENDA A Trip Through the TensorFlow

TensorFlow w/XLA: TensorFlow, Compiled! Expressiveness with performance Pre-release

Some resources for ML/TensorFlow TensorFlow resources A good tutorial (about 2:40:00 long)

Machine learning on mobile and edge devices with TensorFlow Lite Developer advocate for

TensorFlow Flexible, Scalable, Portable Rajat Monga Engineering Director, TensorFlow Released

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

TensorFlow Probability Joshua V. Dillon Software Engineer Google Research What is TensorFlow

TensorFlow Extended (TFX) An End-to-End ML Platform Clemens Mewald TensorFlow Extended (TFX): An

Virtualization and High Availability Mika Karlstedt AMICT'08 May 2008 Faculty of Science

Taming Distributed Pets with Kubernetes Matthew Bates & James Munnelly QCon London 2018

MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH TENSORRT INFERENCE SERVER David Goodwin,

Mission-critical 101 Resiliency Performance Fault Scalability Tolerance Disaster

Stateful workloads on kubernetes with ceph Agenda CaaS Kubernetes

The Good, the Bad and the Ugly The Web Services Stack and Three Myths of Grids and

Advocating for Children of Haitian Descent Born in the Dominican Republic to Undocumented Parents

IETF WEIRDS Working Group Murray Kucherawy Working Group

TensorFlow: A system for large-scale machine learning Martn Abadi - PowerPoint PPT Presentation

TensorFlow: A system for large-scale machine learning Martn Abadi et. al, 2016 Presented by Harrison Brown for R244 Background Originally built by Google engineers as successor to proprietary system for distributed training called

TensorFlow: a Framework for Scalable Machine Learning ACM Learning Center, 2016 You probably

C-FX-02-V1.0 DSV 4.0 2 45 15 TensorFlow TensorBoard TensorFlow

Getting Started with TensorFlow Part I: TensorFlow Graphs and Sessions Nick Winovich Department

TensorFlow: neural networks lab Paolo Dragone and Andrea Passerini paolo.dragone@unitn.it

TensorFlow: A System for Learning-Scale Machine Learning Google Brain The Problem Machine

Tensorflow - A system for large-scale machine learning Presentation: Nat McAleese (nm583)

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

Distributed TensorFlow Stony Brook University CSE545, Fall 2017 Goals Understand

A Trip Through the NGC TensorFlow Container GTC 2019 S9256 AGENDA A Trip Through the TensorFlow

TensorFlow w/XLA: TensorFlow, Compiled! Expressiveness with performance Pre-release

Some resources for ML/TensorFlow TensorFlow resources A good tutorial (about 2:40:00 long)

Machine learning on mobile and edge devices with TensorFlow Lite Developer advocate for

TensorFlow Flexible, Scalable, Portable Rajat Monga Engineering Director, TensorFlow Released

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

TensorFlow Probability Joshua V. Dillon Software Engineer Google Research What is TensorFlow

TensorFlow Extended (TFX) An End-to-End ML Platform Clemens Mewald TensorFlow Extended (TFX): An

Virtualization and High Availability Mika Karlstedt AMICT'08 May 2008 Faculty of Science

Taming Distributed Pets with Kubernetes Matthew Bates &amp; James Munnelly QCon London 2018

MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH TENSORRT INFERENCE SERVER David Goodwin,

Mission-critical 101 Resiliency Performance Fault Scalability Tolerance Disaster

Stateful workloads on kubernetes with ceph Agenda CaaS Kubernetes

The Good, the Bad and the Ugly The Web Services Stack and Three Myths of Grids and

Advocating for Children of Haitian Descent Born in the Dominican Republic to Undocumented Parents

IETF WEIRDS Working Group Murray Kucherawy Working Group

Taming Distributed Pets with Kubernetes Matthew Bates & James Munnelly QCon London 2018