Mocha.jl Deep Learning in Julia Chiyuan Zhang (@pluskid) CSAIL, - PowerPoint PPT Presentation

Mocha.jl Deep Learning in Julia Chiyuan Zhang (@pluskid) CSAIL, MIT

Deep Learning • Learning with multi-layer (3~30) neural networks, on a huge training set. • State-of-the-art on many AI tasks • Computer Vision : Image Classification, Object Detection, Semantic Segmentation, etc. • Speech Recognition & Natural Language Processing : Acoustic Modeling, Language Modeling, Word / Sentence embedding

softmax2 SoftmaxActivation FC AveragePool 7x7+1(V) DepthConcat GoogLeNet (Inception) Winner of ILSVRC 2014, 27 layers, ~7 million parameters Conv Conv Conv Conv 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) Conv Conv MaxPool 1x1+1(S) 1x1+1(S) 3x3+1(S) DepthConcat Conv Conv Conv Conv softmax1 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) Conv Conv MaxPool SoftmaxActivation 1x1+1(S) 1x1+1(S) 3x3+1(S) MaxPool FC 3x3+2(S) DepthConcat FC Conv Conv Conv Conv Conv 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) 1x1+1(S) Conv Conv MaxPool AveragePool 1x1+1(S) 1x1+1(S) 3x3+1(S) 5x5+3(V) DepthConcat Conv Conv Conv Conv 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) Conv Conv MaxPool 1x1+1(S) 1x1+1(S) 3x3+1(S) DepthConcat softmax0 Conv Conv Conv Conv SoftmaxActivation 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) Conv Conv MaxPool FC 1x1+1(S) 1x1+1(S) 3x3+1(S) DepthConcat FC Conv Conv Conv Conv Conv 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) 1x1+1(S) Conv Conv MaxPool AveragePool 1x1+1(S) 1x1+1(S) 3x3+1(S) 5x5+3(V) DepthConcat Conv Conv Conv Conv 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) Conv Conv MaxPool 1x1+1(S) 1x1+1(S) 3x3+1(S) MaxPool 3x3+2(S) DepthConcat Conv Conv Conv Conv 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) Conv Conv MaxPool 1x1+1(S) 1x1+1(S) 3x3+1(S) DepthConcat Conv Conv Conv Conv 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) Conv Conv MaxPool 1x1+1(S) 1x1+1(S) 3x3+1(S) MaxPool 3x3+2(S) LocalRespNorm Conv 3x3+1(S) Conv 1x1+1(V) LocalRespNorm MaxPool 3x3+2(S) Conv 7x7+2(S) input

ISVRC on Imagenet 30 22.5 15 7.5 0 2010 2011 2012 2013 2014 Human ArXiv 2015 Top-5 error Deep learning dominated since 2012; surpassing “human performance” since 2015.

Deep Learning in Speech Recognition image source: Li Deng and Dong Yu. Deep Learning – Methods and Applications .

Deep Neural Networks • A network that consists of computation units (layers, or nodes) connected via a specific architecture. Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E Hinton. “ImageNet Classification with Deep Convolutional Neural Networks.” NIPS 2012.

Deep Learning Made Easy A deep learning toolkit provides common layers, easy ways to define network architecture, and transparent interface to high performance computation backends (BLAS, GPUs, etc.) • C++ : Caffe (widely used on academia), dmlc/cxxnet, cuda- convnet, etc. • Python : Theano (auto-differentiation) and wrappers, NervanaSystems/neon, etc. • Lua : Torch7 (facebook); Matlab : MatConvNet (VGG) • Julia : pluskid/Mocha.jl, dfdx/Boltzmann.jl

Why Mocha.jl? • Written in Julia and for Julia : easily making use of data pre/post processing and visualization tools from Julia. • Minimum dependency : Julia backend ready to run, easy for fast prototyping. • Multiple backends : easily switching to CUDA + cuDNN based backend for highly efficient deep nets training. • Correctness : all computation layers are unit-tested. • Modular architecture : layers, activation functions, network topology, etc. Easily extendable.

Mocha.jl • Deep learning framework for (and written in) Julia; inspired by Caffe; focusing on easy prototyping, customization, and efficiency (switchable computation backends) > Pkg.add(“Mocha”) or for latest dev version: > Pkg.checkout(“Mocha”) > Pkg.test(“Mocha”)

IJulia Example IJulia Image classification example with pre-trained Imagenet model.

Mini-tutor: CNN on MNIST • MNIST: handwritten digits • Data preparation: • Image data in Mocha is represented as 4D tensor: width -by- height -by- channels -by- batch • MNIST: 28-by-28-by-1-by-64 • Mocha supports ND-tensor for general data • HDF5 file: general format for tensor data, also supported by numpy, Matlab, etc.

  Mini-tutor: CNN on MNIST • Data layer   data_layer = AsyncHDF5DataLayer(name="train-data", source="data/ train.txt", batch_size=64, shuffle=true) • data/train.txt lists the HDF5 files for training set • 64 images is provided for each mini-batch • the data is shuffled to improve convergence • async data layer use Julia’s @async to pre-read data while waiting for computation on CPU / GPU

Convolution layer C3: f. maps 16@10x10 C1: feature maps S4: f. maps 16@5x5 INPUT 6@28x28 32x32 S2: f. maps C5: layer OUTPUT F6: layer 6@14x14 120 10 84 Gaussian connections Full connection Subsampling Subsampling Full connection Convolutions Convolutions LeCun, Yann, et al. "Gradient-based learning applied to document recognition." Proceedings of the IEEE 86.11 (1998): 2278-2324. conv_layer = ConvolutionLayer(name="conv1", n_filter=20, kernel=(5,5), bottoms=[:data], tops=[:conv])

Pooling Layer pool_layer = PoolingLayer(name="pool1", kernel=(2,2), stride=(2,2), bottoms=[:conv], tops=[:pool]) • Pooling layer operate on the output of convolution layer • By default, MAX pooling is performed; can switch to MEAN pooling by specifying pooling=Pooling.Mean()

Blobs & Net Architecture • Network architecture is determined by connecting tops (output) blobs to bottoms (input) blobs with matching blob names . • Layers are automatically sorted and connected as a directed acyclic graph (DAG).

Rest of the layers conv2_layer = ConvolutionLayer(name="conv2", n_filter=50, kernel=(5,5), bottoms=[:pool], tops=[:conv2]) pool2_layer = PoolingLayer(name="pool2", kernel=(2,2), stride=(2,2), bottoms=[:conv2], tops=[:pool2]) fc1_layer = InnerProductLayer(name="ip1", output_dim=500, neuron=Neurons.ReLU(), bottoms=[:pool2], tops=[:ip1]) fc2_layer = InnerProductLayer(name="ip2", output_dim=10, bottoms=[:ip1], tops=[:ip2]) loss_layer = SoftmaxLossLayer(name="loss", bottoms=[:ip2, :label])

SGD Solver params = SolverParameters(max_iter=10000, regu_coef=0.0005, mom_policy=MomPolicy.Fixed(0.9), lr_policy=LRPolicy.Inv(0.01, 0.0001, 0.75), load_from=exp_dir) solver = SGD(params)

Coffee Breaks… … for the solver setup_coffee_lounge(solver, save_into="$exp_dir/ statistics.jld", every_n_iter=1000) # report training progress every 100 iterations add_coffee_break(solver, TrainingSummary(), every_n_iter=100) # save snapshots every 5000 iterations add_coffee_break(solver, Snapshot(exp_dir), every_n_iter=5000)

Solver Statistics Solver statistics will be automatically saved if coffee lounge is set up. Snapshots save the training progress periodically, can continue training from the last snapshot after interruption.

Demo: GPU vs CPU backend = use_gpu ? GPUBackend() : CPUBackend()

Parameter Sharing • When a layer has trainable parameters (e.g. convolution, inner-product layers), those parameters will be registered under the layer name, and shared by layers with the same name • Use cases • Validation network during training • Pre-training, fine-tuning • Advanced architectures, time-delayed nodes

Parameter Sharing

3rd most star-ed Julia package Contributions are very welcome!

Mocha.jl Deep Learning in Julia Chiyuan Zhang (@pluskid) CSAIL, - PowerPoint PPT Presentation

Mocha.jl Deep Learning in Julia Chiyuan Zhang (@pluskid) CSAIL, MIT Deep Learning Learning with multi-layer (3~30) neural networks, on a huge training set. State-of-the-art on many AI tasks Computer Vision : Image Classification,

MOCHA: Federated Multi-Task Learning NIPS 17 Virginia Smith Stanford / CMU Chao-Kai

Testable JavaScript James Kovacs Technical Evangelist, JetBrains @jameskovacs | jameskovacs.com

A Modest Proposal A Modest Proposal For preventing the Testing of User Interfaces From being a

Matrix Multiplication Example (Cost Analysis, 45 in 2.4) The Mundo Candy Company makes three

coffee sandwich LATTE drip etc.. $ 10.00 coffee click me coffee Espresso Cappuccino