@MagnusHyttsten
Meet Robin
Guinea Pig Meet Robin
An Awkward Social Experiment (that I'm afraid you need to be part of...)
ROCKS!
"GTC" Input Data Examples (Train & Test Data) Model <Awkward Output (Your Silence> Brain)
"GTC" Labels Input Data (Correct Answers) Examples (Train & Test Data) "Rocks" Model Output Loss (Your function Brain) Optimizer
"GTC" Labels Input Data (Correct Answers) Examples (Train & Test Data) "Rocks" "Rocks" Model Output Loss (Your function Brain) Optimizer
Agenda Intro to Machine Learning Creating a TensorFlow Model Why are GPUs Great for Machine Learning Workloads Distributed TensorFlow Training
Agenda Intro to Machine Learning Creating a TensorFlow Model Why are GPUs Great for Machine Learning Workloads Distributed TensorFlow Training
Agenda Intro to Machine Learning Creating a TensorFlow Model Why are GPUs Great for Machine Learning Workloads Distributed TensorFlow Training
Premade Estimators Datasets Estimator tf.keras tf.keras.layers Python Frontend Java C++ TensorFlow Distributed Execution Engine CPU GPU Android iOS ...
TensorFlow Estimator Architecture Estimator (tf.estimator) calls input_fn (Datasets, tf.data)
Premade Estimators Estimator (tf.estimator) calls input_fn (Datasets, tf.data) subclass Premade Estimators LinearClassifier DNNLinearCombinedClassifier DNNLinearCombinedRegressor DNNClassifier BaselineClassifier LinearRegressor BaselineRegressor DNNRegressor
Premade Estimators Premade Estimators Datasets LinearRegressor(...) LinearClassifier(...) DNNRegressor(...) DNNClassifier(...) estimator = DNNLinearCombinedRegressor(...) DNNLinearCombinedClassifier(...) BaselineRegressor(...) BaselineClassifier(...) # Train locally estimator.train ( input_fn=..., ... estimator.evaluate( input_fn=..., ...) Datasets estimator.predict ( input_fn=..., ...)
Custom Models #1 - model_fn Estimator (tf.estimator) calls calls Keras Layers (tf.keras.layer) model_fn input_fn use (Datasets, tf.data) subclass Premade Estimators LinearClassifier DNNLinearCombinedClassifier DNNLinearCombinedRegressor DNNClassifier BaselineClassifier LinearRegressor BaselineRegressor DNNRegressor
Custom Models #2 - Keras Model Estimator Keras model_to_estimator (tf.estimator) (tf.keras) calls calls Keras Layers (tf.keras.layer) model_fn input_fn use (Datasets, tf.data) subclass Premade Estimators LinearClassifier DNNLinearCombinedClassifier DNNLinearCombinedRegressor DNNClassifier BaselineClassifier LinearRegressor BaselineRegressor DNNRegressor
Custom Models tf.keras.layers tf.keras # Imports yada yada ... model = Sequential() model.add(Conv2D(32, kernel_size=(3, 3), activation='relu')) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Flatten()) model.add(Dense(128, activation='relu')) model.add(Dropout(0.2)) model.add(Dense(10, activation='softmax')) model.compile(loss='categorical_crossentropy' , optimizer='adam' , metrics=['accuracy'] )
Train/Evaluate Model Estimator Datasets # Convert a Keras model to tf.estimator.Estimator ... estimator = keras.estimator.model_to_estimator ( model, ... ) # Train locally estimator.train ( input_fn=..., ... estimator.evaluate( input_fn=..., ...) Datasets estimator.predict ( input_fn=..., ...)
Summary - Use Estimators, Datasets, and Keras ● Premade Estimators (tf.estimator): When possible ● Custom Models a. model_fn in Estimator & tf.keras.layers b. Keras Models (tf.keras) estimator = keras.model_to_estimator(...) ■ ● Datasets (tf.data) for the input pipeline
Agenda Intro to Machine Learning Creating a TensorFlow Model Why are GPUs Great for Machine Learning Workloads Distributed TensorFlow Training
Disclaimer... High-Level - We look at only parts of the power of GPUs ● ● Simple Overview - More optimal designs exist Reduced Scope - Only considering fully-connected layers, etc ●
Strengths of V100 GPU Built for Massively Parallel Computations ● ● Hardware & software suitable to manage Deep Learning Workloads (Tensor Cores, mixed-precision execution, etc)
Strengths of V100 GPU Built for Massively Parallel Computations ● ● Specific hardware / software to manage Deep Learning Workloads (Tensor Cores, mixed-precision execution, etc) Tesla SXM V100 ● 5376 cores (FP32)
Strengths of V100 GPU What are we going to do with 5376 FP32 cores?
Strengths of V100 GPU What are we going to do with 5376 FP32 cores? "Execute things in parallel"!
Strengths of V100 GPU What are we going to do with 5376 FP32 cores? "Execute things in parallel"! Yes, but how can we exactly do that for ML Workloads?
Strengths of V100 GPU What are we going to do with 5376 FP32 cores? "Execute things in parallel"! Yes, but how can we exactly do that for ML Workloads? "Hey, that's your job - That's why we're here listening"!
Strengths of V100 GPU What are we going to do with 5376 FP32 cores? "Execute things in parallel"! Yes, but how can we exactly do that for ML Workloads? "Hey, that's your job - That's why we're here listening"! Alright, let's talk about that then
We may have a huge number of layers ● Each layer can have huge number of neurons ● --> There may be 100s millions or even billions * and + ops All knobs are W values that we need to tune So that given a certain input, they generate the correct output
"Matrix Multiplication is EATING (the computing resources of) THE WORLD" h i_j = [X 0 , X 1 , X 2, ... ] * [W 0 , W 1 , W 2, ... ] h i_j = X 0 *W 0 + X 1 *W 1 + X 2 *W 2 + ...
Matmul X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h 0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6
Single-threaded Execution
Single-threaded Execution X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h 0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6 X W [ [ 1 0.1 2 0.1 . . * . . . . 256 0.1 [ [
Single-threaded Execution X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h 0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6 X W [ [ 1*0.1 = 0.1 1 0.1 2 0.1 . . * . . . . 256 0.1 [ [
Single-threaded Execution X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h 0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6 X Prev W [ [ 1*0.1 = 0.1 1 0.1 2 0.1 0.1 . . * . . . . 256 0.1 [ [
Single-threaded Execution X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h 0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6 X Prev W [ [ 1*0.1 = 0.1 1 0.1 2 0.1 0.1 + 2*0.1 = 0.3 . . * . . . . 256 0.1 [ [
Single-threaded Execution X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h 0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6 X Prev W [ [ 1*0.1 = 0.1 1 0.1 2 0.1 0.1 + 2*0.1 = 0.3 . . . * . . . 3238.5+255*0.1 = 3264 . . 256 0.1 3264 + 256*0.1 = 3289.6 [ [
Single-threaded Execution X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h 0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6 X Prev W Single-threaded [ [ Execution 1*0.1 = 0.1 1 0.1 2 0.1 0.1 + 2*0.1 = 0.3 256 * t . . . * . . . 3238.5+255*0.1 = 3264 . . 256 0.1 3264 + 256*0.1 = 3289.6 [ [
GPU Execution
GPU - #1 Multiplication Step X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h 0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6 X W [ [ 1 0.1 2 0.1 . . * . . . . 256 0.1 [ [
GPU - #1 Multiplication Step X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h 0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6 X W [ [ 1 0.1 Tesla SXM V100 2 0.1 5376 cores (FP32) . . * . . . . 256 0.1 [ [
GPU - #1 Multiplication Step X = [1.0, 2.0, ..., 256.0] # Let's say we have 256 input values W = [0.1, 0.1, ..., 0.1] # Then we need to have 256 weight values h 0,0 = X * W # [1*0.1 + 2*0.1 + ... + 256*0.1] == 32389.6 X W [ [ 1 0.1 2 0.1 . . * . . . . 256 0.1 [ [
Recommend
More recommend