mocha jl
play

Mocha.jl Deep Learning in Julia Chiyuan Zhang (@pluskid) CSAIL, - PowerPoint PPT Presentation

Mocha.jl Deep Learning in Julia Chiyuan Zhang (@pluskid) CSAIL, MIT Deep Learning Learning with multi-layer (3~30) neural networks, on a huge training set. State-of-the-art on many AI tasks Computer Vision : Image Classification,


  1. Mocha.jl Deep Learning in Julia Chiyuan Zhang (@pluskid) CSAIL, MIT

  2. Deep Learning • Learning with multi-layer (3~30) neural networks, on a huge training set. • State-of-the-art on many AI tasks • Computer Vision : Image Classification, Object Detection, Semantic Segmentation, etc. • Speech Recognition & Natural Language Processing : Acoustic Modeling, Language Modeling, Word / Sentence embedding

  3. softmax2 SoftmaxActivation FC AveragePool 7x7+1(V) DepthConcat GoogLeNet (Inception) Winner of ILSVRC 2014, 27 layers, ~7 million parameters Conv Conv Conv Conv 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) Conv Conv MaxPool 1x1+1(S) 1x1+1(S) 3x3+1(S) DepthConcat Conv Conv Conv Conv softmax1 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) Conv Conv MaxPool SoftmaxActivation 1x1+1(S) 1x1+1(S) 3x3+1(S) MaxPool FC 3x3+2(S) DepthConcat FC Conv Conv Conv Conv Conv 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) 1x1+1(S) Conv Conv MaxPool AveragePool 1x1+1(S) 1x1+1(S) 3x3+1(S) 5x5+3(V) DepthConcat Conv Conv Conv Conv 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) Conv Conv MaxPool 1x1+1(S) 1x1+1(S) 3x3+1(S) DepthConcat softmax0 Conv Conv Conv Conv SoftmaxActivation 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) Conv Conv MaxPool FC 1x1+1(S) 1x1+1(S) 3x3+1(S) DepthConcat FC Conv Conv Conv Conv Conv 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) 1x1+1(S) Conv Conv MaxPool AveragePool 1x1+1(S) 1x1+1(S) 3x3+1(S) 5x5+3(V) DepthConcat Conv Conv Conv Conv 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) Conv Conv MaxPool 1x1+1(S) 1x1+1(S) 3x3+1(S) MaxPool 3x3+2(S) DepthConcat Conv Conv Conv Conv 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) Conv Conv MaxPool 1x1+1(S) 1x1+1(S) 3x3+1(S) DepthConcat Conv Conv Conv Conv 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) Conv Conv MaxPool 1x1+1(S) 1x1+1(S) 3x3+1(S) MaxPool 3x3+2(S) LocalRespNorm Conv 3x3+1(S) Conv 1x1+1(V) LocalRespNorm MaxPool 3x3+2(S) Conv 7x7+2(S) input

  4. ISVRC on Imagenet 30 22.5 15 7.5 0 2010 2011 2012 2013 2014 Human ArXiv 2015 Top-5 error Deep learning dominated since 2012; surpassing “human performance” since 2015.

  5. Deep Learning in Speech Recognition image source: Li Deng and Dong Yu. Deep Learning – Methods and Applications .

  6. Deep Neural Networks • A network that consists of computation units (layers, or nodes) connected via a specific architecture. Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E Hinton. “ImageNet Classification with Deep Convolutional Neural Networks.” NIPS 2012.

  7. Deep Learning Made Easy A deep learning toolkit provides common layers, easy ways to define network architecture, and transparent interface to high performance computation backends (BLAS, GPUs, etc.) • C++ : Caffe (widely used on academia), dmlc/cxxnet, cuda- convnet, etc. • Python : Theano (auto-differentiation) and wrappers, NervanaSystems/neon, etc. • Lua : Torch7 (facebook); Matlab : MatConvNet (VGG) • Julia : pluskid/Mocha.jl, dfdx/Boltzmann.jl

  8. Why Mocha.jl? • Written in Julia and for Julia : easily making use of data pre/post processing and visualization tools from Julia. • Minimum dependency : Julia backend ready to run, easy for fast prototyping. • Multiple backends : easily switching to CUDA + cuDNN based backend for highly efficient deep nets training. • Correctness : all computation layers are unit-tested. • Modular architecture : layers, activation functions, network topology, etc. Easily extendable.

  9. Mocha.jl • Deep learning framework for (and written in) Julia; inspired by Caffe; focusing on easy prototyping, customization, and efficiency (switchable computation backends) > Pkg.add(“Mocha”) or for latest dev version: > Pkg.checkout(“Mocha”) > Pkg.test(“Mocha”)

  10. IJulia Example IJulia Image classification example with pre-trained Imagenet model.

  11. Mini-tutor: CNN on MNIST • MNIST: handwritten digits • Data preparation: • Image data in Mocha is represented as 4D tensor: width -by- height -by- channels -by- batch • MNIST: 28-by-28-by-1-by-64 • Mocha supports ND-tensor for general data • HDF5 file: general format for tensor data, also supported by numpy, Matlab, etc.

  12. 
 Mini-tutor: CNN on MNIST • Data layer 
 data_layer = AsyncHDF5DataLayer(name="train-data", source="data/ train.txt", batch_size=64, shuffle=true) • data/train.txt lists the HDF5 files for training set • 64 images is provided for each mini-batch • the data is shuffled to improve convergence • async data layer use Julia’s @async to pre-read data while waiting for computation on CPU / GPU

  13. Convolution layer C3: f. maps 16@10x10 C1: feature maps S4: f. maps 16@5x5 INPUT 6@28x28 32x32 S2: f. maps C5: layer OUTPUT F6: layer 6@14x14 120 10 84 Gaussian connections Full connection Subsampling Subsampling Full connection Convolutions Convolutions LeCun, Yann, et al. "Gradient-based learning applied to document recognition." Proceedings of the IEEE 86.11 (1998): 2278-2324. conv_layer = ConvolutionLayer(name="conv1", n_filter=20, kernel=(5,5), bottoms=[:data], tops=[:conv])

  14. Pooling Layer pool_layer = PoolingLayer(name="pool1", kernel=(2,2), stride=(2,2), bottoms=[:conv], tops=[:pool]) • Pooling layer operate on the output of convolution layer • By default, MAX pooling is performed; can switch to MEAN pooling by specifying pooling=Pooling.Mean()

  15. Blobs & Net Architecture • Network architecture is determined by connecting tops (output) blobs to bottoms (input) blobs with matching blob names . • Layers are automatically sorted and connected as a directed acyclic graph (DAG).

  16. Rest of the layers conv2_layer = ConvolutionLayer(name="conv2", n_filter=50, kernel=(5,5), bottoms=[:pool], tops=[:conv2]) pool2_layer = PoolingLayer(name="pool2", kernel=(2,2), stride=(2,2), bottoms=[:conv2], tops=[:pool2]) fc1_layer = InnerProductLayer(name="ip1", output_dim=500, neuron=Neurons.ReLU(), bottoms=[:pool2], tops=[:ip1]) fc2_layer = InnerProductLayer(name="ip2", output_dim=10, bottoms=[:ip1], tops=[:ip2]) loss_layer = SoftmaxLossLayer(name="loss", bottoms=[:ip2, :label])

  17. SGD Solver params = SolverParameters(max_iter=10000, regu_coef=0.0005, mom_policy=MomPolicy.Fixed(0.9), lr_policy=LRPolicy.Inv(0.01, 0.0001, 0.75), load_from=exp_dir) solver = SGD(params)

  18. Coffee Breaks… … for the solver setup_coffee_lounge(solver, save_into="$exp_dir/ statistics.jld", every_n_iter=1000) # report training progress every 100 iterations add_coffee_break(solver, TrainingSummary(), every_n_iter=100) # save snapshots every 5000 iterations add_coffee_break(solver, Snapshot(exp_dir), every_n_iter=5000)

  19. Solver Statistics Solver statistics will be automatically saved if coffee lounge is set up. Snapshots save the training progress periodically, can continue training from the last snapshot after interruption.

  20. Demo: GPU vs CPU backend = use_gpu ? GPUBackend() : CPUBackend()

  21. Parameter Sharing • When a layer has trainable parameters (e.g. convolution, inner-product layers), those parameters will be registered under the layer name, and shared by layers with the same name • Use cases • Validation network during training • Pre-training, fine-tuning • Advanced architectures, time-delayed nodes

  22. Parameter Sharing

  23. 3rd most star-ed Julia package Contributions are very welcome!

Recommend


More recommend