본문 바로가기
Tech Development/Deep Learning (CNN)

Convolutional & Pooling Layers

by JK from Korea 2022. 12. 25.

Convolutional & Pooling Layers

 

Date : 2022.10.22

 

*The contents of this book is heavily based on Stanford University’s CS231n course.

 

[Convolutional Neural Network]

CNN is a widely used technique in Computer Vision. CNN is similar to the network we’ve already built. We need to add the Convolution and Pooling layer to make the multilayer network into a CNN. The structure looks something like the following.

 

[CNN Simple Ver.]

[The Convolutional Layer]

Unlike the affine layer, the convolutional layer maintains the original data as it is originally given. Let me explain. The affine layer uses ‘flattened’ data, which, in most cases, is a lower dimension than the originally given data (images data are represented in higher dimensions). If our original image data is in 3 dimensions, but we flatten the data into 1, this can omit critical space patterns.

 

So, the convolutional layer prevents critical pattern loss by maintaining data shape. Now we get to deal with 3 dimensional matrix multiplication.

 

[Convolutional Multiplication]

 

[Basic Convolutional Operation]

 

[Operations in Convolutional Multiplication]

 

The input data of the convolutional layer is fused multiply-add (FMA). The process uses filters, which are called kernels, and these move around the image as you see above. The gray areas where the kernel overlaps with the input data are called windows.

 

The left matrix is the input data, and the kernels correspond to the weight and bias parameters.

 

[Input and Weight Multiplication followed by Bias Addition]

 

The bias is always a 1x1 matrix. This is the basics of conv multiplication, and now I will introduce two more techniques for the conv to successfully deal with 3 dimensional data.

 

1. Padding

[Add an extra layer of 0’s surrounding the input data]

 

Padding is used to adjust the output size. Due to the kernels, the output size gradually decreases after each conv layer. Thus we add extra 0’s in the input data to prevent the size reduction.

 

2. Stride

[The unit in which Kernels move]

Stride is the unit for kernel movement.

 

Both padding and stride affects the output size in different ways. Padding and the output size are proportionate, while the stride and output size are inverse proportionate.

[Input Size = (H, W), Filter Size = (FH, FW), Padding = P, Stride = S, Output Size = (OH, OW)]

The equation above helps us calculate the output size with the necessary parameters. However, we should be sure that the output width and height are both natural numbers.

 

The input so far has been 2 dimensional; height and weight. Now we shall add the third dimension which is the channel. Each channel has a corresponding kernel. The figure below is a more intuitive representation.

 

[3 Dimensional Convolution Multiplication]

Each kernel should have the same shape (height, width), and the number should be equal to the number of input channels.

 

[Convolution Multiplication in Blocks]

Before we dive in any further, we need to remind ourselves of the characteristics of the input and output data we want. The input data is an image. Typically, an image is 2 dimensional, the height and width. But what about the color? The color is another feature of an image and we represent the color as the third dimension, which is why a single image has height, width, and channel. The output of the conv layer is to maintain the features of the input image while computing the dot multiplication with weight (kernel) and bias parameters. Afterall, deep learning is simply a million linear functions put together.

 

The figure above is the calculation for one input image and one weight variable. However, as the fully connected network we already built, there are multiple input images and multiple weight variables (nodes). In a fully connected network, each input node (data) is connected with all the weight variables. Thus, every input is conv multiplied with all kernels (weights).

 

[Feature Mapping for one Image]

 

[N number of input data with Bias]

Now we’re done with the conv layer.

 

[The Pooling Layer]

‘Pool’ing, like a swimming ‘pool’, is a method in which we select a few representative data from a designated ‘pool’ of numbers. Through this process we are able to reduce the input size.

 

[Max Pooling]

The pooling layer is unique in a sense that it does not use any ‘learning’ parameters. No weight, bias, or gradients are involved in this layer. It is simply pooling max or average values of the data. The data’s shape is the only parameter affected by the pooling layer. The channel remains constant. Further, the pooling is more resistant to input data.

 

[The Pooling Selects Representative Values]

 

728x90
반응형

댓글