@ NVIDIA GTC, April 6 th , 2016 � A Powerful, Flexible, and Intui5ve Deep Learning Framework � Shohei Hido Chief Research Officer Preferred Networks, Inc.
Overview � http://chainer.org/ l Chainer is a Python-based deep learning framework l Chainer v1.0 was released as an open source on June 2015 l It DOESN’T rely on Theano, unlike other Python frameworks l Chainer uses a unique scheme named Define-by-Run l Why do users sOll need another framework? l How different and effecOve Chainer is? 2�
Preferred Networks (PFN) A startup that applies deep learning to industrial IoT � l Founded: March 2014 l Headquarter: Tokyo, Japan l U.S. Subsidiary: San Mateo, California l Company size: 35 engineers & researchers l Investors: Toyota, FANUC, NTT Manufacturing � Healthcare � Automotive � Industrial IoT Deep learning 3�
Partnering with world-leading companies using Chainer l R&D collaboraOon on industrial problems with real-world data ̶ Specific requirements, modified algorithms, many trials and errors, etc ̶ Different from making general-purpose recogniOon system NTT FANUC Toyota Panasonic Cisco NVIDIA 4�
Two types of background behind DL frameworks � 1. Scalability-oriented 2. Flexibility-oriented l Use-cases in mind l Use-cases in mind ̶ Image/speech recogniOon system ̶ Algorithm research ̶ Fast DL as a service in cloud ̶ R&D projects for new products l Problem type l Problem type ̶ A few general applicaOons ̶ Various specific applicaOons ̶ 10+ million training samples ̶ 10+ k training samples ̶ 10+ nodes cluster w/ fast network ̶ 1 node with mulOple GPUs l Possible boZleneck l Possible boZleneck ̶ Tuning of well-known algorithms ̶ Trial-and-error in prototyping ̶ Distributed computaOon for ̶ Debugging, profiling & refactoring model/data-parallel training ̶ (wait Ome during compilaOon)
Designed for efficient research & development � l Flexible: new kinds of complex models for various applicaOons l IntuiOve: rapid prototyping and efficient trial-and-error l Powerful: comparable performance for 1 node & mulO-GPUs Scalability-oriented Flexibility-oriented 6�
Agenda � l Deep learning framework basics l IntroducOon to Chainer l CuPy: NumPy-compaOble GPU library l Performance and applicaOons � 7�
Neural network and computation � Forward computation � Image� Object: x 1 � Tulip h 1 � ・・ k 1 � y 1 � ・・ ・・・・ ・・ Sensor� ・・ Anomaly score: 0.35 ・・ ・・ k M � y M � h H � Category: x N � Sports Text� Input Hidden units Output Backward computation (backpropagation) � 8�
Chainer focuses on network representation/training � l Design choices for deep learning frameworks ̶ How to build neural networks? ̶ How to train neural networks? ̶ Which text format/language for modeling? ̶ Which language for compuOng? ̶ Run with GPU? ̶ Run on mulOple GPUs? ̶ Run on mulOple compute nodes? � 9�
Building and training neural networks: Computational graph construction is the key � Construct a computaOonal graph 1. ̶ Based on network definiOon given by users ̶ Chains of funcOons and operaOons on input variables Compute loss and gradients 2. ̶ Forward computaOon to calculate loss for a minibatch ̶ BackpropagaOon gives gradients to all of parameters OpOmize model 3. ̶ Update each parameter with the gradient ̶ Repeat unOl convergence Step 1. is the most important and there are many approaches � 10�
Building blocks � l These funcOonaliOes are very similar between frameworks l But the structure, abstracOon level, and interface are different l It comes to the design of domain-specific language for NN � Array data structure Network (vector/matrix/tensor) � (computational graph) � Optimizer Operations & functions � (SGD/AdaGrad/Adam) � 11�
Types of domain-specific language for neural networks � l Symbolic program l ImperaOve program l Text DSL ̶ OperaOons ̶ Direct computaOons ̶ Ex. Caffe (prototxt) on symbols on raw data arrays ̶ Ex. CNTK (NDL) � ̶ Ex. Theano ̶ Ex. Torch.nn ̶ Ex. TensorFlow ̶ Ex. Chainer � %% DefiniOon in text f: { Ex. MXNet “A”: “Variable”, # Symbolic definiOon # ImperaOve declaraOon “B”: “Variable”, A = Variable(‘A’) a = np.ones(10) “C”: [“B”, “*”, “A”], B = Variable(‘B’) b = np.ones(10) * 2 “ret”: [“C”, “+”, 1] C = B * A c = b * a } D = C + Constant(1) d = c + 1 # Compile # Compile f = compile(“f.txt”) f = compile(D) d = f(A=np.ones(10), d = f(A=np.ones(10), B=np.ones(10) * 2) B=np.ones(10) * 2) 12�
Comparison of DSL type � DSL type Pros. Cons. • Human-readable definiOon • Users must study the format • Non-programmer can easily • Format might have to be Text DSL edit the network extended for new algorithms • StaOc analysis at compile • Users must study special syntax • OpOmizaOon before training • May need more efforts to Symbolic • Easy to parallelize implement new algorithms Internal DSL • Less efforts to learn syntax • Hard to opOmize in advance • Easy debugging and profiling • Less efficient in memory ImperaOve • Suitable for new algorithms allocaOon and parallelizaOon with complex logic Chainer is at the extreme end of imperaOve program for high flexibility � 13�
Agenda � l Deep learning framework basics l IntroducOon to Chainer l CuPy: NumPy-compaOble GPU library l Performance and applicaOons � 14�
Chainer as an open-source project � l hZps://github.com/pfnet/chainer l 50 contributors l 1,277 stars & 255 fork Original developer l 3,708 commits Seiya Tokui l AcOve development & release for last 10 months ̶ v1.0.0 (June 2015) to v1.7.2 (March 2016) 15�
� Chainer software stack � l Chainer is built on top of NumPy and CUDA l CuPy is also introduced as an equivalent of NumPy on GPU � Chainer � CuPy NumPy � cuDNN � BLAS � CUDA � CPU � NVIDIA GPU � 16�
� Graph build scheme (1/2) - Define-and-Run: Most of frameworks use this scheme (Chainer does not) � l Define: build a computaOonal graph based on definiOon l Run: update the model (parameters) using training dataset � Define � Parameters � Network ComputaOonal Gradient definiOon � graph � funcOon � Auto differenOaOon Run � Parameters � Update ComputaOonal Gradient Training graph � funcOon � data � Loss & gradient 17�
Graph build scheme (2/2) - Define-by-Run: Computational graph construction on the fly � l No graph is constructed before training l Instead, the graph is built at each forward computaOon l ComputaOonal graph can be modified dynamically for each iteraOon/sample or depending on some condiOons Define-by-Run � Model Parameters � definiOon � Update ComputaOonal Gradient Training graph � funcOon � data � Dynamic change CondiOons 18�
Define-by-Run example: MLP for MNIST � l Only transformaOons between units are set before training l1 = Linear(784, n_units) l2 = Linear(n_units, 10)) l ConnecOon is given as forward computaOon � def forward(x): h1 = ReLU(l1(x)) return l2(h1) bias � bias � W � W � 0 5 x� h1 � y � 9 ReLU Linear l1 Linear l2 19�
Define-by-Run: An interpreted language for neural network � l Idea ̶ Forward computaOon actually goes through computaOonal graph ̶ By remembering the history, the actual graph can be obtained l Advantage ̶ Flexibility for new algorithms with complex components u Ex. recurrent, recursive, aZenOon, memory, adversarial, etc ̶ IntuiOve coding with highly imperaOve network definiOon u Ex. stochasOc network of which graph changes for each iteraOon l Current drawbacks ̶ Graph is generated every Ome also for fixed networks ̶ No opOmizaOon even for staOc part of graphs u JIT-like analysis and subgraph cache might be useful 20�
Basic components (1/2): Variable and Function � l Variable ̶ Variable wraps arrays ( .data ) Variable ̶ It remembers parent funcOon 0 x� h1 � y � 5 ( .creator ) 9 ̶ It will be assigned gradient ( .grad ) ̶ It keeps track of not only data but also computaOons l FuncOon ̶ TransformaOon between Variable x� y � ̶ Stateless Function ̶ e.g. sigmoid, tanh, ReLU, maxpooling, dropout � 21�
Basic components (2/2): Link and Chain � l Link = funcOon with state b � W � ̶ Parameters are also Variable and gradients will be assigned x� y � ̶ e.g. Linear (fully-connected), LSTM Link ConvoluOon2d, word-embedding (Linear) y=f(W*x+b) l Chain = network ̶ Chain has a set of child Link bias � bias � W � W � ̶ Forward computaOon is defined � � h1 � y � in . __call__() �� ReLU � Linear l1 Linear l2 ̶ e.g. MLP2, AlexNet, GoogLeNet, Chain (MLP2) RNNLM, seq2seq, 22�
Backpropagation through computational graph � l Consider an objecOve (Link.Linear): L = f(x * w + b) l This computes the value of L in forward computaOon, and simultaneously builds the following computaOonal graph is Variable W � b � is FuncOon x� * + f L � l The gradient of L can be computed with respect to any variables by backpropagaOon l Then the opOmizer updates the value of parameters 23�
Recommend
More recommend