R ESTRICTED B OLTZMANN M ACHINES AND D EEP B ELIEF N ETWORKS ON M ULTI -C ORE P ROCESSORS Jo˜ Noel Lopes Bernardete Ribeiro ao Gonc ¸alves University of Coimbra Polytechnic Institute of Guarda June 11, 2012 WCCI–IJCNN
D EEP B ELIEF N ETWORKS (DBN S ) “Deep belief nets are probabilistic generative models that are composed of multiple layers of stochastic latent variables. The latent variables typically have binary values and are often called hidden units or feature detectors. [...] The lower layers receive top-down, directed connections from the layers above. The states of the units in the lowest layer represent a data vector.” Geoffrey E. Hinton [Hinton et al., 2006]
O UTLINE Motivation Deep Belief Networks Restricted Boltzmann Machines GPU implementation Results on MNIST Handwritten Database Conclusions and Future Work
M OTIVATION The robustness and efficiency by which humans can recognize objects has ever been an intriguing challenge in computational intelligence.
M OTIVATION The robustness and efficiency by which humans can recognize objects has ever been an intriguing challenge in computational intelligence. Theoretical results suggest that deep architectures are fundamental to learn complex functions that can represent high-level abstractions (e.g. vision, language) [Bengio, 2009]
M OTIVATION The robustness and efficiency by which humans can recognize objects has ever been an intriguing challenge in computational intelligence. Theoretical results suggest that deep architectures are fundamental to learn complex functions that can represent high-level abstractions (e.g. vision, language) [Bengio, 2009] Empirical results show their successful application: classification, regression, dimensionality reduction, object recognition, information retrieval, robotics, and collaborative filtering etc. [Larochelle et al., 2007, Swersky et al., 2010].
D EEP VERSUS SHALLOW ARCHITECTURES model outputs ( y ) level d high-order features · · · level 2 model outputs ( y ) low-order features non-linear operations level 1 model inputs ( x ) model inputs ( x ) deep architecture shallow architecture
D EEP B ELIEF N ETWORKS DBNs are composed of several Restricted Boltzmann Machines (RBMs) stacked on top of each other. · · · h 3 · · · h 2 · · · h 1 · · · x
R ESTRICTED B OLTZMANN M ACHINES An RBM is an energy-based generative model that consists of a layer of binary visible units, v , and a layer of binary hidden units, h . hidden units · · · h j · · · h 1 h 2 h 3 h J 1 bias decoder encoder v 1 v 2 · · · v i · · · v I 1 bias visible units
R ESTRICTED B OLTZMANN M ACHINES Given an observed state, the energy of the joint configuration of the visible and hidden units ( v , h ) is given by (1): I J J I � � � � E ( v , h ) = − a i v i − b j h j − W ji v i h j , (1) i =1 j =1 j =1 i =1 h 1 h 2 h 3 · · · h j · · · h J 1 v 1 v 2 · · · v i · · · v I 1
R ESTRICTED B OLTZMANN M ACHINES The RBM defines a joint probability over ( v , h ) : p ( v , h ) = e − E ( v , h ) , (2) Z where Z is the partition function, obtained by summing the energy of all possible ( v , h ) configurations: e − E ( v , h ) . � Z = (3) v , h · · · · · · h 1 h 2 h 3 h j h J 1 v 1 v 2 · · · v i · · · v I 1
R ESTRICTED B OLTZMANN M ACHINES Given a random input configuration v , the state of the hidden unit j is set to 1 with probability: I � p ( h j = 1 | v ) = σ ( b j + v i W ji ) , (4) i =1 Similarly, given a random hidden vector, h , the state of the visible unit i can be set to 1 with probability: J � p ( v i = 1 | h ) = σ ( a i + h j W ji ) . (5) j =1
T RAINING AN RBM The following learning rule performs stochastic steepest ascent in the log probability of the training data: ∂ log p ( v , h ) = � v i h j � 0 − � v i h j � ∞ (6) ∂W ji where �·� 0 denotes the expectations for the data distribution ( p 0 ) and �·� ∞ denotes the expectations under the model distribution h 1 h 2 h 3 · · · h j · · · h J 1 v 1 v 2 · · · v i · · · v I 1
G IBBS SAMPLING h ( 0 ) · · · j � v i h j � 0 · · · i v ( 0 ) = x p ( h j = 1 | v ) = σ ( b j + � I i =1 v i W ji )
A LTERNATING G IBBS SAMPLING h ( 0 ) · · · j � v i h j � 0 · · · · · · i i v ( 0 ) = x v ( 1 ) p ( v i = 1 | h ) = σ ( a i + � J j =1 h j W ji )
A LTERNATING G IBBS SAMPLING h ( 0 ) h ( 1 ) h ( 2 ) h ( ∞ ) · · · j · · · j · · · j · · · j � v i h j � 0 � v i h j � ∞ i · · · i · · · i · · · i · · · v ( 0 ) = x v ( 1 ) v ( 2 ) v ( ∞ )
C ONTRASTIVE D IVERGENCE (CD– k ) Hinton proposed the Contrastive Divergence (CD) algorithm CD– k replaces � . � ∞ by �·� k for small values of k .
C ONTRASTIVE D IVERGENCE (CD– k ) v ( 0 ) ← x Compute the binary (features) states of the hidden units, h ( 0 ) , using v ( 0 ) for n ← 1 to k Compute the “reconstruction” states for the visible units, v ( n ) , using h ( n − 1 ) Compute the “reconstruction” states for the hidden units, h ( n ) , using v ( n ) end for Update the weights and biases, according to: ∆ W ji = γ ( � v i h j � 0 − � v i h j � k ) (7) ∆ b j = γ ( � h j � 0 − � h j � k ) (8) ∆ a i = γ ( � v i � 0 − � v i � k ) (9)
D EEP B ELIEF N ETWORKS (DBN) · · · h 1 p ( h 1 | x ) p ( x | h 1 ) · · · x
D EEP B ELIEF N ETWORKS (DBN) · · · h 2 p ( h 2 | h 1 ) p ( h 1 | h 2 ) · · · h 1 p ( h 1 | x ) p ( x | h 1 ) · · · x
D EEP B ELIEF N ETWORKS (DBN) · · · h 3 p ( h 3 | h 2 ) p ( h 2 | h 3 ) · · · h 2 p ( h 2 | h 1 ) p ( h 1 | h 2 ) · · · h 1 p ( h 1 | x ) p ( x | h 1 ) · · · x
D EEP B ELIEF N ETWORKS (DBN) high-level features (concepts) · · · h 3 · · · h 2 · · · h 1 low-level features · · · x
GPU IMPLEMENTATION Training a DBN is a computationally expensive task that involves training several RBMs and may require a considerable amount of time.
GPU IMPLEMENTATION Training a DBN is a computationally expensive task that involves training several RBMs and may require a considerable amount of time. Solution? GPU Parallel implementation
CUDA – D EVICE ARCHITECTURE Device Streaming Multiprocessor SM N · · · Streaming Multiprocessor SM 2 Streaming Multiprocessor SM 1 Shared Memory Processor Processor Processor Instruction · · · 1 2 M Unit Device Memory
CUDA – L AUNCHING A KERNEL GRID Block(3,0) Thread(0,0) Thread(1,0) Thread(2,0) Thread(3,0) Grid Block(0,0) Block(1,0) Block(2,0) Block(3,0) Thread(0,1) Thread(1,1) Thread(2,1) Thread(3,1) Thread(0,2) Thread(1,2) Thread(2,2) Thread(3,2) Block(0,1) Block(1,1) Block(2,1) Block(3,1) Threads within a block can share information.
CUDA – L AUNCHING A KERNEL GRID Block(3,0) Thread(0,0) Thread(1,0) Thread(2,0) Thread(3,0) Grid Block(0,0) Block(1,0) Block(2,0) Block(3,0) Thread(0,1) Thread(1,1) Thread(2,1) Thread(3,1) Thread(0,2) Thread(1,2) Thread(2,2) Thread(3,2) Block(0,1) Block(1,1) Block(2,1) Block(3,1) Threads within a block can share information. However blocks are required to run independently.
CUDA – L AUNCHING A KERNEL GRID Block(3,0) Thread(0,0) Thread(1,0) Thread(2,0) Thread(3,0) Grid Block(0,0) Block(1,0) Block(2,0) Block(3,0) Thread(0,1) Thread(1,1) Thread(2,1) Thread(3,1) Thread(0,2) Thread(1,2) Thread(2,2) Thread(3,2) Block(0,1) Block(1,1) Block(2,1) Block(3,1) Threads within a block can share information. However blocks are required to run independently. To address scalability the tasks should be partitioned.
CUDA – S CALABILITY Grid Block(0,0) Block(1,0) Block(2,0) Block(3,0) Block(0,1) Block(1,1) Block(2,1) Block(3,1) execution Device with 2 SMs Device with 4 SMs SM 0 SM 1 SM 0 SM 1 SM 2 SM 3 Block(0,0) Block(0,1) Block(0,0) Block(1,0) Block(2,0) Block(3,0) Block(1,0) Block(1,1) Block(0,1) Block(1,1) Block(2,1) Block(3,1) Block(2,0) Block(2,1) Block(3,0) Block(3,1)
K ERNELS v data ∈ IR N × I RBM inputs ( x ) w ∈ IR J × I ComputeStatusHiddenUnits Step 1. Compute h data weights h data ∈ IR N × J RBM outputs (data) a ∈ IR I ComputeStatusVisibleUnits CorrectWeights Step 2. Compute v recon Step 4. Correct weights visible units bias v recon ∈ IR N × I reconstructed inputs b ∈ IR J ComputeStatusHiddenUnits Step 3. Compute h recon hidden units bias h recon ∈ IR N × J reconstructed outputs
C OMPUTE S TATUS H IDDEN U NITS AND C OMPUTE S TATUS V ISIBLE U NITS KERNELS Each thread represents a connection Multiplies the clamped input by the weight Stores the weight in the shared memory Each block represents a neuron Uses fast shared memory to sum up the values computed by each thread Block ( Neuron ) Connection 1 Connection 2 Connection 3 Connection J . . .
Recommend
More recommend