big data for data science
play

Big Data for Data Science Scalable Machine Learning - PowerPoint PPT Presentation

Big Data for Data Science Scalable Machine Learning event.cwi.nl/lsde A SHORT INTRODUCTION TO NEURAL NETWORKS credits: cs231n.stanford.edu; Fei-Fei Li, Justin Johnson, Serena Yeung event.cwi.nl/lsde Example: Image Recognition Input Image


  1. Big Data for Data Science Scalable Machine Learning event.cwi.nl/lsde

  2. A SHORT INTRODUCTION TO NEURAL NETWORKS credits: cs231n.stanford.edu; Fei-Fei Li, Justin Johnson, Serena Yeung event.cwi.nl/lsde

  3. Example: Image Recognition Input Image Weights Loss AlexNet ‘convolutional’ neural network credits: cs231n.stanford.edu; Fei-Fei Li, Justin Johnson, Serena Yeung event.cwi.nl/lsde

  4. Neural Nets - Basics • Score function (linear, matrix) • Activation function (normalize [0-1]) • Regularization function (penalize complex W) credits: cs231n.stanford.edu; Fei-Fei Li, Justin Johnson, Serena Yeung event.cwi.nl/lsde

  5. Neural Nets are Computational Graphs • Score , Activation and Regularization together with a Loss function 1.00 • For backpropagation, we need a formula for the “gradient”, i.e. the derivative of each computational function: credits: cs231n.stanford.edu; Fei-Fei Li, Justin Johnson, Serena Yeung event.cwi.nl/lsde

  6. Training the model: backpropagation • backpropagate loss to the weights to be adjusted, proportional to a learning rate -0.53 1.00 -1/(1.37) 2 *1.00 = -0.53 • For backpropagation, we need a formula for the “gradient”, i.e. the derivative of each computational function: credits: cs231n.stanford.edu; Fei-Fei Li, Justin Johnson, Serena Yeung, Song Han) event.cwi.nl/lsde

  7. Training the model: backpropagation • backpropagate loss to the weights to be adjusted, proportional to a learning rate -0.53 -0.53 1.00 1 *-0.53 = -0.53 • For backpropagation, we need a formula for the “gradient”, i.e. the derivative of each computational function: credits: cs231n.stanford.edu; Fei-Fei Li, Justin Johnson, Serena Yeung, Song Han) event.cwi.nl/lsde

  8. Training the model: backpropagation • backpropagate loss to the weights to be adjusted, proportional to a learning rate -0.53 -0.53 -0.20 1.00 e -1.00 *-0.53 = -0.20 • For backpropagation, we need a formula for the “gradient”, i.e. the derivative of each computational function: credits: cs231n.stanford.edu; Fei-Fei Li, Justin Johnson, Serena Yeung event.cwi.nl/lsde

  9. Training the model: backpropagation • backpropagate loss to the weights to be adjusted, proportional to a learning rate -0.20 0.20 0.40 0.20 -0.40 0.20 -0.53 -0.53 0.20 -0.20 1.00 -0.60 0.20 • For backpropagation, we need a formula for the “gradient”, i.e. the derivative of each computational function: credits: cs231n.stanford.edu; Fei-Fei Li, Justin Johnson, Serena Yeung event.cwi.nl/lsde

  10. Activation Functions credits: cs231n.stanford.edu; Fei-Fei Li, Justin Johnson, Serena Yeung event.cwi.nl/lsde

  11. Get going quickly: Transfer Learning credits: cs231n.stanford.edu; Fei-Fei Li, Justin Johnson, Serena Yeung event.cwi.nl/lsde

  12. Neural Network Architecture • (mini) batch-wise training • matrix calculations galore credits: cs231n.stanford.edu; Fei-Fei Li, Justin Johnson, Serena Yeung event.cwi.nl/lsde

  13. DEEP LEARNING SOFTWARE credits: cs231n.stanford.edu; Fei-Fei Li, Justin Johnson, Serena Yeung event.cwi.nl/lsde

  14. Deep Learning Frameworks ➔ Caffe2 Caffe Paddle (UC Berkeley) (Facebook) (Baidu) ➔ . PyTorch Torch CNTK (NYU/Facebook) (Facebook) (Microsoft) ➔ TensorFlow Theano MXNET (Univ. Montreal) (Google) (Amazon) • Easily build big computational graphs • Easily compute gradients in these graphs • Run it at high speed (e.g. GPU) credits: cs231n.stanford.edu; Fei-Fei Li, Justin Johnson, Serena Yeung event.cwi.nl/lsde

  15. Deep Learning Frameworks ..have to compute ..gradient computations are ..similar to TensorFlow gradients by hand.. generated automagically from the Not a “new language” but forward phase (z=x*y; b=a+x; c= No GPU support embedded in Python sum(b)) + GPU support (control flow). credits: cs231n.stanford.edu; Fei-Fei Li, Justin Johnson, Serena Yeung event.cwi.nl/lsde

  16. TensorFlow: TensorBoard GUI credits: cs231n.stanford.edu; Fei-Fei Li, Justin Johnson, Serena Yeung event.cwi.nl/lsde

  17. Higher Levels of Abstraction f ormulas “by name” =stochastic gradient descent credits: cs231n.stanford.edu; Fei-Fei Li, Justin Johnson, Serena Yeung event.cwi.nl/lsde

  18. Static vs Dynamic Graphs credits: cs231n.stanford.edu; Fei-Fei Li, Justin Johnson, Serena Yeung event.cwi.nl/lsde

  19. Static vs Dynamic: optimization credits: cs231n.stanford.edu; Fei-Fei Li, Justin Johnson, Serena Yeung, Song Han) event.cwi.nl/lsde

  20. Static vs Dynamic: serialization serialization = create a runnable program from the trained network credits: cs231n.stanford.edu; Fei-Fei Li, Justin Johnson, Serena Yeung, Song Han) event.cwi.nl/lsde

  21. Static vs Dynamic: conditionals, loops credits: cs231n.stanford.edu; Fei-Fei Li, Justin Johnson, Serena Yeung event.cwi.nl/lsde

  22. What to Use? • TensorFlow is a safe bet for most projects. Not perfect but has huge community, wide usage. Maybe pair with high-level wrapper (Keras, Sonnet, etc) • PyTorch is best for research. However still new, there can be rough patches. • Use TensorFlow for one graph over many machines Consider Caffe , Caffe2 , or TensorFlow for production deployment • Consider TensorFlow or Caffe2 for mobile credits: cs231n.stanford.edu; Fei-Fei Li, Justin Johnson, Serena Yeung event.cwi.nl/lsde

  23. DEEP LEARNING PERFORMANCE OPTIMIZATIONS credits: cs231n.stanford.edu, Song Han event.cwi.nl/lsde

  24. ML models are getting larger credits: cs231n.stanford.edu, Song Han event.cwi.nl/lsde

  25. First Challenge: Model Size credits: cs231n.stanford.edu, Song Han event.cwi.nl/lsde

  26. Second Challenge: Energy Efficiency credits: cs231n.stanford.edu, Song Han event.cwi.nl/lsde

  27. Third Challenge: Training Speed credits: cs231n.stanford.edu, Song Han event.cwi.nl/lsde

  28. Hardware Basics credits: cs231n.stanford.edu, Song Han event.cwi.nl/lsde

  29. Special hardware? It’s in your pocket.. • iPhone 8 with A11 chip 6 CPU cores: Apple GPU 2 powerful 4 energy-efficient Apple TPU (deep learning ASIC) only on-chip FPGA missing (will come in time..) event.cwi.nl/lsde

  30. Hardware Basics: Number Representation credits: cs231n.stanford.edu, Song Han event.cwi.nl/lsde

  31. Hardware Basics: Number Representation credits: cs231n.stanford.edu, Song Han event.cwi.nl/lsde

  32. Hardware Basics: Memory = Energy larger model ➔ more memory references ➔ more energy consumed credits: cs231n.stanford.edu, Song Han event.cwi.nl/lsde

  33. Pruning Neural Networks credits: cs231n.stanford.edu, Song Han event.cwi.nl/lsde

  34. Pruning Neural Networks • Learning both Weights and Connections for Efficient Neural Networks, Han, Pool, Tran Dally, NIPS2015 credits: cs231n.stanford.edu, Song Han event.cwi.nl/lsde

  35. Pruning Changes the Weight Distribution credits: cs231n.stanford.edu, Song Han event.cwi.nl/lsde

  36. Pruning Happens in the Human Brain credits: cs231n.stanford.edu, Song Han event.cwi.nl/lsde

  37. Trained Quantization credits: cs231n.stanford.edu, Song Han event.cwi.nl/lsde

  38. Trained Quantization credits: cs231n.stanford.edu, Song Han event.cwi.nl/lsde

  39. Trained Quantization: Before • Continuous weight distribution credits: cs231n.stanford.edu, Song Han event.cwi.nl/lsde

  40. Trained Quantization: After • Discrete weight distribution credits: cs231n.stanford.edu, Song Han event.cwi.nl/lsde

  41. Trained Quantization: How Many Bits? • Deep Compression: compressing deep neural networks with pruning, trained quantization and Huffman coding, Han, Moa, Dally, ICLR2016 credits: cs231n.stanford.edu, Song Han event.cwi.nl/lsde

  42. Quantization to Fixed Point Decimals (=Ints) credits: cs231n.stanford.edu, Song Han event.cwi.nl/lsde

  43. Hardware Basics: Number Representation credits: cs231n.stanford.edu, Song Han event.cwi.nl/lsde

  44. Mixed Precision Training credits: cs231n.stanford.edu, Song Han event.cwi.nl/lsde

  45. Mixed Precision Training credits: cs231n.stanford.edu, Song Han event.cwi.nl/lsde

  46. DEEP LEARNING HARDWARE credits: cs231n.stanford.edu, Song Han event.cwi.nl/lsde

  47. The end of CPU scaling event.cwi.nl/lsde

  48. CPUs for Training - SIMD to the rescue? credits: cs231n.stanford.edu, Song Han event.cwi.nl/lsde

  49. CPUs for Training - SIMD to the rescue? 4 scalar instructions 1 SIMD instruction event.cwi.nl/lsde

  50. “ALU”: arithmetic logic unit (implements +, *, - etc. instructions) CPU vs GPU CPU: A lot of chip GPU: almost all chip surface for cache surface for ALUs memory and control (compute power) GPU cards have their own memory chips: smaller but nearby and faster than system memory credits: cs231n.stanford.edu; Fei-Fei Li, Justin Johnson, Serena Yeung event.cwi.nl/lsde

  51. Programming GPUs • CUDA (NVIDIA only) – Write C-like code that runs directly on the GPU – Higher-level APIs: cuBLAS, cuFFT, cuDNN, etc • OpenCL – Similar to CUDA, but runs on anything – Usually slower :( All major deep learning libraries (TensorFlow, PyTorch, MXNET, etc) support training and model evaluation on GPUs. credits: cs231n.stanford.edu; Fei-Fei Li, Justin Johnson, Serena Yeung event.cwi.nl/lsde

  52. CPU vs GPU: performance credits: cs231n.stanford.edu; Fei-Fei Li, Justin Johnson, Serena Yeung event.cwi.nl/lsde

Recommend


More recommend