getuing starued with tensorflow on gpus
play

Getuing Starued with TensorFlow on GPUs Magnus Hyttsten - PowerPoint PPT Presentation

Getuing Starued with TensorFlow on GPUs Magnus Hyttsten @MagnusHyttsten 1 Agenda + An Awkward Social Experiment (that I'm afraid you will be paru of...) ROCKS! "GTC" Input Data Examples (Train & Test Data) Model


  1. Getuing Starued with TensorFlow on GPUs Magnus Hyttsten @MagnusHyttsten 1

  2. Agenda +

  3. An Awkward Social Experiment (that I'm afraid you will be paru of...)

  4. ROCKS!

  5. "GTC" Input Data Examples (Train & Test Data) Model <Awkward Output (Your Silence> Brain)

  6. "GTC" Labels Input Data (Correct Answers) Examples (Train & Test Data) "Rocks" Model Output Loss (Your function Brain) Optimizer

  7. "GTC" Labels Input Data (Correct Answers) Examples (Train & Test Data) "Rocks" "Rocks" Model Output Loss (Your function Brain) Optimizer

  8. Input Data "Classical" Output Data + Programming Code Input Data Machine Learning Code + Output Data

  9. TensorFlow 2.0 Alpha is out Easy Powergul Scalable Simplified APIs. Flexibility and performance. Tested at Google-scale. Focused on Keras and Power to do cutting edge research Deploy everywhere eager execution and scale to > 1 exaflops

  10. tf.data (Dataset) tf.feature_column (Transfer Learning) High-level APIs Perform Distributed Training (talk @1pm) E.g. V100

  11. tf.data (Dataset) tf.feature_column (Transfer Learning) High-level APIs Perform Distributed Training (talk @1pm) E.g. V100

  12. Built to Distribute and Scale Premade Estimators LinearClassifier DNNLinearCombinedClassifier BoostedTreeClassifier DNNClassifier DNNLinearCombinedRegressor BoostedTreeRegressor BaselineClassifier LinearRegressor BaselineRegressor DNNRegressor calls input_fn (Datasets, tf.data)

  13. Premade Estimators Premade Estimators LinearRegressor(...) Datasets LinearClassifier(...) DNNRegressor(...) DNNClassifier(...) DNNLinearCombinedRegressor(...) DNNLinearCombinedClassifier(...) estimator = BaselineRegressor(...) BaselineClassifier(...) BoostedTreeRegressor(...) BoostedTreeClassifier(...) # Train locally estimator.train ( input_fn=..., ... estimator.evaluate( input_fn=..., ...) Datasets estimator.predict ( input_fn=..., ...)

  14. Premade Estimator - Wide & Deep wide_columns = [ tf.feature_column.bucketized_column( 'age',=[18, 27, 40, 65])] deep_columns = [ tf.feature_column.numeric_column('visits'), tf.feature_column.numeric_column('clicks')] tf.estimator.DNNLinearCombinedClassifier( linear_feature_columns=wide_columns, dnn_feature_columns=deep_columns, dnn_hidden_units=[100, 75, 50, 25])

  15. tf.data (Dataset) tf.feature_column (Transfer Learning) Perform Distributed Training E.g. V100

  16. Custom Models tf.keras.layers tf.keras model = tf.keras.models.Sequential([ tf.keras.layers.Flatten(), tf.keras.layers.Dense(512, activation='relu'), tf.keras.layers.Dropout(0.2), tf.keras.layers.Dense(10, activation='softmax') ]) model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy']) model.fit (dataset, epochs=5) Datasets model.evaluate(dataset) model.predict (dataset)

  17. TensorFlow Datasets import tensorflow_datasets as tfds ● 30+ available train_ds = tfds.load( "imdb_reviews" , ● Add your own split="train", as_supervised=True) audio ● ● translate "nsynth" ○ structured ● "wmt_translate_ende" ○ image "titanic" ● ○ ○ "wmt_translate_enfr" "celeb_a" ○ text ● video ● "cifar10" "imdb_reviews" ○ ○ ○ "bair_robot_pushing_small" "coco2014" ○ "lm1b" ○ "moving_mnist" ○ "diabetic_retinopathy_detection" "squad" ○ ○ ○ "starcrafu_video" "imagenet2012" ○ "mnist" ○ "open_images_v4" ○

  18. TensorFlow Summary ● Datasets (tf.data) for the input pipeline a. TensorFlow Datasets is great b. tf.feature_columns are cool too ● Premade Estimators Keras Models (tf.keras) ●

  19. The V-100 And why is it so good @ Machine Learning???

  20. Disclaimer High-Level - We look at only parts of the power of GPUs ● ● Simple Overview - More optimal designs exist Reduced Scope - Only considering fully-connected layers, etc ●

  21. Strengths of V100 Built for Massively Parallel Computations ● ● Specific hardware / software to manage Deep Learning Workloads (Tensor Cores, mixed-precision execution, etc)

  22. Strengths of V100 Built for Massively Parallel Computations ● ● Specific hardware / software to manage Deep Learning Workloads (Tensor Cores, mixed-precision execution, etc) Tesla SXM V100 ● 5376 cores (FP32)

  23. My Questions Around the GPU What are we going to do with 5376 FP32 cores?

  24. The Unsatisfactory Answer What are we going to do with 5376 FP32 cores? "Execute things in parallel"!

  25. What are we going to do with 5376 FP32 cores? "Execute things in parallel"! Yes, but how can we exactly do that for ML Workloads?

  26. What are we going to do with 5376 FP32 cores? "Execute things in parallel"! Yes, but how can we exactly do that for ML Workloads? "Hey, that's your job - That's why we're here listening"!

  27. What are we going to do with 5376 FP32 cores? "Execute things in parallel"! Yes, but how can we exactly do that for ML Workloads? "Hey, that's your job - That's why we're here listening"! Alright, let me try to talk about that then

  28. We may have a huge number of layers ● Each layer can have huge number of neurons ● --> There may be 100s millions or even billions * and + ops All knobs are W values that we need to tune So that given a certain input, they generate the correct output

  29. "Matrix Multiplication is EATING (the computing resources of) THE WORLD" h i_j = [X 0 , X 1 , X 2, ... ] * [W 0 , W 1 , W 2, ... ] h i_j = X 0 *W 0 + X 1 *W 1 + X 2 *W 2 + ...

  30. Matmul X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h 0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6

  31. Single-threaded Execution

  32. Single-threaded Execution X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h 0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6 X W [ [ 1 0.1 2 0.1 . . * . . . . 256 0.1 [ [

  33. Single-threaded Execution X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h 0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6 X W [ [ 1*0.1 = 0.1 1 0.1 2 0.1 . . * . . . . 256 0.1 [ [

  34. Single-threaded Execution X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h 0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6 X Prev W [ [ 1*0.1 = 0.1 1 0.1 2 0.1 0.1 . . * . . . . 256 0.1 [ [

  35. Single-threaded Execution X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h 0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6 X Prev W [ [ 1*0.1 = 0.1 1 0.1 2 0.1 0.1 + 2*0.1 = 0.3 . . * . . . . 256 0.1 [ [

  36. Single-threaded Execution X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h 0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6 X Prev W [ [ 1*0.1 = 0.1 1 0.1 2 0.1 0.1 + 2*0.1 = 0.3 . . . * . . . 3238.5+255*0.1 = 3264 . . 256 0.1 3264 + 256*0.1 = 3289.6 [ [

  37. Single-threaded Execution X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h 0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6 X Prev W Single-threaded [ [ Execution 1*0.1 = 0.1 1 0.1 2 0.1 0.1 + 2*0.1 = 0.3 256 * t . . . * . . . 3238.5+255*0.1 = 3264 . . 256 0.1 3264 + 256*0.1 = 3289.6 [ [

  38. GPU Execution

  39. GPU - #1 Multiplication Step X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h 0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6 X W [ [ 1 0.1 2 0.1 . . * . . . . 256 0.1 [ [

  40. GPU - #1 Multiplication Step X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h 0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6 X W [ [ 1 0.1 Tesla SXM V100 2 0.1 5376 cores (FP32) . . * . . . . 256 0.1 [ [

  41. GPU - #1 Multiplication Step X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h 0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6 X W [ [ 1 0.1 2 0.1 . . * . . . . 256 0.1 [ [

Recommend


More recommend