CS480/680 Machine Learning Lecture 20: Convolutional Neural Network Zahra Sheikhbahaee March 29, 2020 University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 1
Outline Convolution Zero Padding Stride Weight Sharing Pooling Convolutional neural net architecture LeNet-5 AlexNet ResNet Inception University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 2
Computer Vision Tasks Using Convolutional Networks Object Detection Neural Style transfer Figure: Faster R-CNN model Semantic Segmentation Figure: B The Shipwreck of the Minotaur by J.M.W. Turner, 1805. C The Starry Night by Vincent van Gogh, 1889. D Der Schrei by Edvard Munch, 1893. Figure: FCN University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 3
Convolutional Neural Networks ◮ Convolutional neural network (CNN) is designed to automatically and adaptively learn spatial hierarchies of features through a backpropagation algorithm. ◮ A deficiency of fully connected architectures is that the topology of the input is entirely ignored. ◮ Convolutional neural networks combine three mechanisms: - local receptive field - Share weight - Spatial or temporal sampling ◮ CNN is composed of multiple building blocks, such as convolution layers, pooling layers, and fully connected layers. University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 4
Definition of Convolution Mathematical definition a b � � g ( x , y ) = K ∗ I = K ( i , j ) I ( x − i , y − j ) i = − a j = − b Figure: Edge detection with horizontal and vertical filters - The rhs image is convolved by a 3 × 3 Sobel filter which puts a little bit more weight on the central pixels. The coefficient matrices for the Sobel filter are 1 0 − 1 and G y = G T G x = 2 0 − 2 x . An edge is where the pixel intensity changes in a notorious way. A good way to express changes is by using 1 0 − 1 � derivatives. The Sobel operator calculates the approximation to a derivative of an image by G = G 2 x + G 2 y . University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 5
Detecting Vertical Edges n in : number of input features n out : number of output features k : Convolutional kernel size University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 6
Some Other Kernel Examples University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 7
Gabor Filters ◮ Gabor filters: common feature maps inspired by the human vision system and it is used for texture analysis. ◮ A Gabor filter can be viewed as a sinusoidal plane of particular frequency and orientation, modulated by a Gaussian envelope. ◮ Weights : Grey → zero, white → positive, black → negative University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 8
Padding Schemes Definition of Padding The number of zeros concatenated at the beginning and at the end of an axis ( p ). Why padding is important? - The shrinking output - Throwing away the information on corners of the images To preserve the size of the output as the input, p = k − 1 padding is needed for the input image. 2 University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 9
Strided Convolution Definition of stride The distance between two consecutive positioning of the kernel along axes ( s ). (d) 2 × 2 strides (e) unit strides Padding with p zeros changes the effective input size from n in to n in + 2 p . Then the size of the output is equal to n out = n in − k + 2 p + 1 . s University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 10
Weight Sharing – In CNNs, each filter is replicated across the entire visual field. These replicated units share the same parameterization (weight vector and bias) and form a feature map. – This provides the basis for the invariance of the network outputs to translation and distortions of the input images. – weight sharing helps in reducing over-fitting due to the reduced number of trainable parameters. – modeling local correlations is easy with CNNs through weight sharing scheme. University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 11
Pooling Layers Figure: Architecture of CNN: FC is a fully connected layer, ReLU denotes a rectified Linear Unit and c i is the number of input channels - A convolution layer computes feature response maps that involve multiple channels within some localized spatial region. - A pooling layer is restricted to act within just one channel at a time, condensing the activation values in each spatially local region in the currently considered channel. - pooling operations play a role in producing downstream representations that are more robust to the effects of variations in data while still preserving important motifs. - Pooling layer does not have any parameter. University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 12
Type of Pooling Layers ◮ Max Pooling ◮ Average Pooling University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 13
Drawback of Max and Average Pooling ◮ Max pooling drawback ◮ Average pooling drawback University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 14
Convolutional neural net architecture (LeNet-5) Figure: Lecun et al. 1998 ◮ The input is a 32 × 32 grayscale image which passes through the first convolutional layer. ◮ Layer C 1 has 6 feature maps (filters) and 5 × 5 filter size with a stride of one and no padding. The image dimensions changes from 32 × 32 × 1 to 28 × 28 × 6 . The layer has 156 trainable parameters. ◮ Layer S 2 is a subsampling layer with 6 feature maps. It has a filter size 2 × 2 and a stride of s = 2 . The resulting image dimensions will be reduced to 14 × 14 × 6 . Layer S 2 has only 12 trainable parameter. ◮ Layer C 3 is a convolutional layer with 16 feature maps. The filter size is 5 × 5 and a stride of 1 and it has 1516 trainable parameters. ◮ The S 4 layer is an average pooling layer with filter size 2 × 2 and a stride of 2. This layer has 16 feature maps with 32 parameters and its output will be reduced to 5 × 5 × 16 . ◮ The fifth layer C 5 is a fully connected convolutional layer with 120 feature maps each of size 1 × 1 . Each of the 120 units in C 5 is connected to all the 5 × 5 × 16 nodes in the s 4 layer. ◮ The sixth layer is a fully connected layer F 6 with 84 units and there is a fully connected Euclidean radial basis function (RBF) instead of softmax function . University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 15
Convolutional neural net architecture (AlexNet) Figure: Krizhevsky et al. 2012 ◮ It contains 5 convolutional layers and 3 fully connected layers. ◮ The first convolutional layer filters the 227 × 227 × 3 input image with 96 kernels of size 11 × 11 × 3 with a stride of s = 4 pixels. ◮ The second convolutional layer takes as input the (response-normalized and pooled) output of the first convolutional layer and filters it with 256 kernels of size 5 × 5 × 48 - Use Relu instead of Tanh to add non-linearity. It accelerates the speed by 6 times at the same accuracy. - Use dropout instead of regularisation to deal with overfitting. - Overlap pooling to reduce the size of network. - Use multiple GPU to train 62.3 millions parameters. - Employ Local Response Normalization University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 16
Convolutional neural net architecture (AlexNet) ◮ The parallelization scheme that employed in AlexNet puts half of the kernels on each GPU. ◮ The GPUs communicate only in certain layers. ◮ top 48 kernels on GPU 1 : color-agnostic ◮ bottom 48 kernels on GPU 2 : color-specific. ◮ . This scheme reduces our top-1 and top-5 error rates by 1 . 7 % and 1 . 2 % , respectively. University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 17
Deep Residual Networks ◮ Training deep neural networks with gradient based optimizers and learning methods can cause vanishing and exploding gradients during backpropagation. ◮ the degradation problem: As the network depth increasing, accuracy gets saturated and then degrades rapidly. Adding more layers to a suitably deep model leads to higher training error. ◮ A residual network is a solution for the upper mentioned problems. These networks are easier to optimize, and can gain accuracy from considerably increased depth. ◮ The shortcut connections simply perform identity mapping, and their outputs are added to the outputs of the stacked layers. x l and x l + 1 : input and output of the l -th unit F and h ( x l ) : a residual function and an identity mapping y l = h ( x l ) + F ( x l , W l ) x l + 1 = f ( y l ) ���� ReLU If f is also an identity mapping: x l + 1 = y l , so we have x l + 1 = x l + F ( x l , W l ) L − 1 � x L = x l + F ( x i , W i ) i = l L − 1 ∂ε ∂ε ∂ x L ∂ε ∂ � = = ( 1 + F ( x i , W i )) ∂ x l ∂ x L ∂ x l ∂ x L ∂ x l i = l University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 18
Convolutional neural net architecture (ResNet) Figure: He et al. 2015 ◮ 152-layer model for the ImageNet competition with 3 . 57 % top 5 error (better than human performance). ◮ Every residula block has two 3 × 3 convolutional layers ◮ Periodically, double number of filters and downsample spatially using stride 2 which divides by 2 in each dimension ◮ There is additional conv layer at the beginning ◮ No fully connected (FC) layers at the end, just a globally averaging pooling layer (only FC 1000 to output classes) University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 19
Recommend
More recommend