math for vgg
play

Math for VGG 1 Intro I am writing this to help you understand what - PDF document

Math for VGG 1 Intro I am writing this to help you understand what the code is doing. Its still work in progress, toward making it more reader friendly. At this point, its just a bunch of math formulas you do not want to follow. 2


  1. Math for VGG 1 Intro I am writing this to help you understand what the code is doing. It’s still work in progress, toward making it more reader friendly. At this point, it’s just a bunch of math formulas you do not want to follow. 2 Notational Conventions • Since there are lots of variables that take multiple indices, it would be difficult to parse them if we use subscripts for indices. We therefore put indices in parens, like x ( i, j, k, l ) , instead of subscripts x i,j,k,l . This would be much easier to read. 3 Symbols Constant Parameters and Indexes • B (Batch size) : the number of samples in a mini-batch – 0 ≤ b < B (batch) : an index of a sample in a mini batch • C (Classes) : the number of classes – 0 ≤ c < C (class) : an index of a class • IC (Input Channels) : the number of channels in an input image of a layer (e.g., three if an image has red, green and blue components) – 0 ≤ ic < IC (input channel index) : an index of a channel in an input image • OC (Output Channels) : the number of channels in an output image of a layer – 0 ≤ oc < OC (output channel index) : an index of a channel in an output image • H (Height) : the number of pixels in a single column of an image – 0 ≤ i < H (image row index) • W (Width) : the number of pixels in a single row of an image – 0 ≤ j < W (image column index) • K (Kernel size) : half of the number of pixels in a single row or a single column. throughout VGG, K is actually always 1 and the kernel is actually 3x3 pixels. – − K ≤ i ′ ≤ K (kernel row index) – − K ≤ j ′ ≤ K (kernel column index) 1

  2. Multidimensional Data • x ( b, ic, i, j ) : a batch of images input to a layer • y ( b, oc, i, j ) : a batch of images output from a layer • w ( oc, ic, i ′ , j ′ ) : filters (kernels) applied to each image 4 Convolution2D Description: Convolution takes a batch of images ( x ) and a filter ( w ) and outputs another batch of images ( y ). An input batch x consists of B images, each of which consists of IC channels, each of which consists of ( H × W ) pixels. A filter is essentially a small image. It consists of OC output channels, each of which consists of IC input channels, each of which consits of (2 K +1) × (2 K +1) pixels. An output batch consists of B images, each of which consists of OC channels, each of which consists of ( H × W ) pixels. Each pixel in the output is obtained by taking the inner product of the filter • x ( b, ic, i, j ) : the pixel value of b th image’s ic th chanel Forward: ∑ w ( oc, ic, i ′ , j ′ ) x ( b, ic, i + i ′ , j + j ′ ) y ( b, oc, i, j ) = (1) 0 ≤ ic < H, − K ≤ i ′ ≤ K, − K ≤ j ′ ≤ K, The actual code must take care of array index underflow and overflow. In the expression above, we assume all elements whose indices underflow or overflow are zero. Backward: ∂y ( b ′ , oc, i, j ) ∂L ∂L ∑ = (2) ∂x ( b, ic, i + i ′ , j + j ′ ) ∂y ( b ′ , oc, i, j ) ∂x ( b, ic, i + i ′ , j + j ′ ) b ′ ,oc,i,j ∂L ∂y ( b, oc, i, j ) ∑ = (3) ∂y ( b, oc, i, j ) ∂x ( b, ic, i + i ′ , j + j ′ ) oc,i,j ∂L ∑ ∂y ( b, oc, i, j ) w ( oc, ic, i ′ , j ′ ) = (4) oc,i,j Equivalently, let i ′′ = i + i ′ and j ′′ = j + j ′ . i ′′ − i ′ < H 0 ≤ i = (5) j ′′ − j ′ < W 0 ≤ j = (6) ∂L ∂L ∑ ∂y ( b, oc, i ′′ − i ′ , j ′′ − j ′ ) w ( oc, ic, i ′ , j ′ ) = (7) ∂x ( b, ic, i ′′ , j ′′ ) oc, i ′′ − H < i ′ ≤ i ′′ , j ′′ − W < j ′ ≤ j ′′ Replacing i ′′ with i and j ′′ with j for readability, we get ∂L ∂L ∑ ∂y ( b, oc, i − i ′ , j − j ′ ) w ( oc, ic, i ′ , j ′ ) = (8) ∂x ( b, ic, i, j ) oc, i − H < i ′ ≤ i, j − W < j ′ ≤ j 2

  3. ∂y ( b, oc ′ , i, j ) ∂L ∂L ∑ = (9) ∂w ( oc, ic, i ′ , j ′ ) ∂y ( b, oc ′ , i, j ) ∂w ( oc, ic, i ′ , j ′ ) b,oc ′ ,i,j ∂L ∂y ( b, oc, i, j ) ∑ = (10) ∂y ( b, oc, i, j ) ∂w ( oc, ic, i ′ , j ′ ) b,i,j ∂L ∑ ∂y ( b, oc, i, j ) x ( b, ic, i + i ′ , j + j ′ ) = (11) b,i,j 5 Linear4D Forward: ∑ y ( b, c, 0 , 0) = x ( b, ic, 0 , 0) w ( ic, c ) (12) ic Backward: ∂y ( b ′ , c, 0 , 0) ∂L ∂L ∑ = (13) ∂x ( b, ic, 0 , 0) ∂y ( b ′ , c, 0 , 0) ∂x ( b, ic, 0 , 0) b ′ ,c ∂L ∑ = ∂y ( b, c, 0 , 0) w ( ic, c ) (14) c ∂y ( b, c ′ , 0 , 0) ∂L ∂L ∑ = (15) ∂w ( ic, c ) ∂y ( b, c ′ , 0 , 0) ∂w ( ic, c ) b,c ′ ∂L ∑ = ∂y ( b, c, 0 , 0 , ) x ( b, ic, 0 , 0) (16) b 6 Dropout4 Forward: y ( b, c, i, j ) = R ( b, c, i, j ) x ( b, c, i, j ) (17) where R ( b, c, i, j ) is a random matrix whose element is 0 with probatility p and 1 / (1 − p ) with probability (1 − p ) Backward: ∂L ∂L = ∂y ( b, c, i, j ) R ( b, c, i, j ) (18) ∂x ( b, c, i, j ) 7 BatchNormalization4 Forward: 1 ∑ µ ( ic ) = x ( b, ic, i, j ) , (19) BHW b,i,j 1 σ 2 ( ic ) ∑ ( x ( b, ic, i, j ) − µ ( ic )) 2 , = (20) BHW b,i,j x ( b, ic, i, j ) − µ ( ic ) ˆ x ( b, ic, i, j ) = , (21) √ σ 2 ( ic ) + ϵ y ( b, ic, i, j ) = γ ( ic )ˆ x ( b, ic, i, j ) + β ( ic ) . (22) 3

  4. Backward: ∂L ∂L ∂y ( b, ic, i, j ) ∑ = (23) ∂γ ( ic ′ ) ∂γ ( ic ) ∂y ( b, ic, i, j ) b,i,j ∂L x ( b, ic, i, j ) − µ ( ic ) ∑ = (24) ∂y ( b, ic, i, j ) √ σ 2 ( ic ) + ϵ b,i,j ∂L ∂L ∑ = (25) ∂β ( ic ) ∂y ( b, ic, i, j ) b,i,j ∂L ∂L ∂y = (26) ∂ ˆ x ( b, ic, i, j ) ∂y ( b, ic, i, j ) ∂ ˆ x ( b, ic, i, j ) ∂L = ∂y ( b, ic, i, j ) γ ( ic ) (27) ∂L ∂L ∂ ˆ x ( b, ic, i, j ) ∑ = (28) ∂σ 2 ( ic ) ∂σ 2 ( ic ) ∂ ˆ x ( b, ic, i, j ) b,i,j − 1 ∂L x ( b, ic, i, j ) − µ ( ic ) ∑ = (29) ( σ 2 ( ic ) + ϵ ) 3 / 2 2 ∂ ˆ x ( b, ic, i, j ) b,i,j ∂σ 2 ( ic ) ∂L ∂L ∂ ˆ x ( b, ic, i, j ) ∂L ∑ = + (30) ∂σ 2 ( ic ) ∂µ ( ic ) ∂ ˆ x ( b, ic, i, j ) ∂µ ( ic ) ∂µ ( ic ) b,i,j ∂L 1 ∂L 2 ∑ ∑ = − + ( µ ( ic ) − x ( b, ic, i, j )) (31) √ ∂σ 2 ( ic ) ∂ ˆ x ( b, ic, i, j ) σ 2 ( ic ) + ϵ BHW b,i,j b,i,j ∂L 1 ∑ = − (32) √ ∂ ˆ x ( b, ic, i, j ) σ 2 ( ic ) + ϵ b,i,j ∂σ 2 ( ic ) ∂L ∂L ∂ ˆ x ( b, ic, i, j ) ∂L ∂L ∂µ ( ic ) = ∂x ( b, ic, i, j ) + ∂x ( b, ic, i, j ) + (33) ∂σ 2 ( ic ) ∂x ( b, ic, i, j ) ∂ ˆ x ( b, ic, i, j ) ∂µ ( ic ) ∂x ( b, ic, i, j ) ∂L 1 ∂L 2 ∂L 1 = + BHW ( x ( b, ic, i, j ) − µ ( ic )) + (34) ∂σ 2 ( ic ) ∂ ˆ x ( b, ic, i, j ) √ σ 2 ( ic ) + ϵ ∂µ ( ic ) BHW ∂L γ ( ic ) = (35) ∂y ( b, ic, i, j ) √ σ 2 ( ic ) + ϵ    − 1 ∂L x ( b, ic, i, j ) − µ ( ic ) 2 ∑ + BHW ( x ( b, ic, i, j ) − µ ( ic )) (36)  ( σ 2 ( ic ) + ϵ ) 3 / 2 2 ∂ ˆ x ( b, ic, i, j ) b,i,j   ∂L 1 1 ∑ +  − (37)  √ ∂ ˆ x ( b, ic, i, j ) σ 2 ( ic ) + ϵ BHW b,i,j ∂L γ ( ic ) = (38) ∂y ( b, ic, i, j ) √ σ 2 ( ic ) + ϵ   γ ( ic ) ∂L x ( b, ic, i, j ) − µ ( ic ) ∑  ( x ( b, ic, i, j ) − µ ( ic )) − (39) ( σ 2 ( ic ) + ϵ ) 3 / 2 BHW ∂y ( b, ic, i, j ) b,i,j 4

  5.   γ ( ic ) ∂L 1 ∑ − (40)  BHW ∂y ( b, ic, i, j ) √ σ 2 ( ic ) + ϵ b,i,j ∂L γ ( ic ) = (41) √ ∂y ( b, ic, i, j ) σ 2 ( ic ) + ϵ   1 γ ( ic ) ∂L x ( b, ic, i, j ) − µ ( ic ) ∑  ( x ( b, ic, i, j ) − µ ( ic )) (42) − BHW √ ∂y ( b, ic, i, j ) σ 2 ( ic ) + ϵ σ 2 ( ic ) + ϵ b,i,j 1 γ ( ic ) ∂L ∑ − (43) BHW √ ∂y ( b, ic, i, j ) σ 2 ( ic ) + ϵ b,i,j ∂L γ ( ic ) = (44) ∂y ( b, ic, i, j ) √ σ 2 ( ic ) + ϵ   1 γ ( ic ) ∂L x ( b, ic, i, j ) − µ ( ic )  x ( b, ic, i, j ) − µ ( ic ) ∑ (45) − BHW √ ∂y ( b, ic, i, j ) √ √ σ 2 ( ic ) + ϵ σ 2 ( ic ) + ϵ σ 2 ( ic ) + ϵ b,i,j 1 γ ( ic ) ∂L ∑ − (46) BHW √ σ 2 ( ic ) + ϵ ∂y ( b, ic, i, j ) b,i,j ∂L γ ( ic ) = (47) ∂y ( b, ic, i, j ) √ σ 2 ( ic ) + ϵ 1 γ ( ic ) ∂L − ∂γ ( ic ) ˆ x ( b, ic, i, j ) (48) BHW √ σ 2 ( ic ) + ϵ 1 γ ( ic ) ∂L (49) − BHW √ ∂β ( ic ) σ 2 ( ic ) + ϵ γ ( ic ) ( ∂L 1 ( ∂L ∂L )) = ∂y ( b, ic, i, j ) − ∂γ ( ic ) ˆ x ( b, ic, i, j ) + (50) √ BHW ∂β ( ic ) σ 2 ( ic ) + ϵ 8 Relu4 Forward: y ( b, c, i, j ) = max(0 , x ( b, c, i, j )) (51) { x ( b, c, i, j ) x ( b, c, i, j ) ≥ 0 = (52) 0 x ( b, c, i, j ) < 0 Backward:  ∂L ∂L ( x ( b, c, i, j ) ≥ 0)  = (53) ∂y ( b, c, i, j ) ∂x ( b, c, i, j ) 0 otherwise  9 MaxPooling2d Forward: Si ≤ i ′ <S ( i +1) ,Sj ≤ j ′ <S ( j +1) x ( b, c, i ′ , j ′ ) y ( b, c, i, j ) = max (54) Backward: ∂y ( b ′ , c ′ , i ′ , j ′ ) ∂L ∂L ∑ = (55) ∂x ( b, c, i, j ) ∂y ( b ′ , c ′ , i ′ , j ′ ) ∂x ( b, c, i, j ) b ′ ,c ′ ,i ′ ,j ′ 5

Recommend


More recommend