Introduction to Convolutional Neural Networks

In the last unit we learned about the fundamentals of vision, images and Computer Vision. We also explored visual features as a crucial part of analyzing images with the help of computers.

The approaches we discussed are today often referred to as “classical” Computer Vision. While working fine on many small and restrained datasets and settings, classical methods have their limits that come to light when looking at bigger scale real-world datasets.

In this unit, we will learn about Convolutional Neural Networks, an important step forward in terms of scale and performance of Computer Vision.

Convolution: Basic Ideas

Convolution is an operation used to extract features from data. The data can be 1D, 2D or 3D. We’ll explain the operation with a solid example. All you need to know now is that the operation simply takes a matrix made of numbers, moves it through the data, and takes the sum of products between the data and that matrix. This matrix is called kernel or filter. You might say, “What does it have to do with the feature extraction, and how am I supposed to apply it?” Don’t panic! We’re getting to it.

To illustrate the intuition, let’s take a look at this example. We have this 1D data, and we visualize it. Visualization will help understand the effects of convolution operation.

We have this kernel [-1, 1]. We’ll start from the left-most element, put the kernel, multiply the overlapping numbers, and sum them up. Kernels have centers; it’s one of the elements. Here, we pick the center as 1 (the element on the right). We picked the center here as 1, assuming an imaginary zero on the left, it is called a pad, you will see later on. Now, the kernel’s center has to touch every single element, so we put a zero to the left of the element for convenience. If we don’t pad it, I’ll have to start multiplying -1 with the left-most element, and 1 will not touch the left-most element, so we apply padding. Let’s see what it looks like.

I’m multiplying the left-most element (that is currently a pad) with -1, and the first element (zero) with 1 and sum them up, get a 0, and note it down. Now, we’ll move the kernel by one position and do the same. Note it down again, this movement is called striding, this is usually done by moving the kernel by one pixel. You can also move it with more pixels. The result (convolved data) is currently an array [0, 0].

We will repeat it until the right element of the kernel touches every element, which yields the below result.

Notice anything? The filter gives the rate of change in the data (the derivatives!). This is one characteristic we could extract from our data. Let’s visualize it.

The convolved data (the result of the convolution) is called a feature map. And it makes sense, as it shows the features we can extract, the characteristics related to the data, and the rate of change.

This is exactly what edge detection filters do! Let’s see it in 2-dimensional data. This time, our kernel will be different. It will be a 3x3 kernel (just so you know it could’ve been a 2x2 too).

This filter is actually quite famous, but we won’t spoil it for you now :). The previous filter was [-1 1]. Meanwhile, this one is [-1 0 1]. It’s just of shape 3x3 and nothing different, and it shows increments and decrements on the horizontal axis. Let’s see an example and apply convolution. Below is our 2D data.

Think of this as an image, and we want to extract the horizontal changes. Now, the center of the filter has to touch every single pixel, so we pad the image.

The feature map will be the same size as the original data. The result of the convolution will be written to the same position that the center of the kernel touched in the original matrix, meaning, for this one, it will touch the leftmost and the top positions.

If we keep applying the convolution, we get the following feature map.

Which shows us the horizontal changes (the edges). This filter is actually called the Prewitt Filter.

You can flip the Prewitt filter to get the changes in vertical direction. The Sobel filter is another famous filter for edge detection.

Convolutional Neural Networks

Fine, but what does it have to do with deep learning? Well, brute forcing filters to extract features does not work well with every image. Imagine if we could somehow find the optimal filters to extract important information or even detect objects in the images. That’s where convolutional neural networks come into play. We convolve images with various filters, and these pixels in the feature maps will eventually become the parameters that we will optimize, and in the end, we will find the best filters for our problem.

The idea is, that we will use filters to extract information. We will randomly initialize multiple filters, create our feature maps, feed them to a classifier, and do backpropagation. Before diving into it, I’d like to introduce you to something we call “pooling”.

As you can see above, there are many pixels that show the change in the feature map. To know that there’s an edge, we only need to see that there’s a change (an edge, a corner, anything), and that’s it.

In the above example, we could have got only one of the two, and that would be enough. This way, we would store fewer parameters and still have the features. This operation of getting the most important element in the feature map is called pooling. With pooling, we lose the exact pixel location of where there’s an edge but we store fewer parameters. Also, this way, our feature extraction mechanism will be more robust to small changes, e.g., we only need to know that there are two eyes, a nose, and a mouth to know that there’s a face in an image, the distance between those elements and the size of those elements tend to change from face to face, and pooling enables the model to be more robust against these changes. Another good thing about pooling is that it helps us handle varying input sizes. We’d like you to watch this video to gain a better intuition. Below is the max pooling operation, where every four pixels, we get the maximum pixel. There are various types of pooling, e.g., average pooling, weighted pooling, or L2 pooling.

Let’s build a simple CNN architecture. We will use a Keras example (for the sake of illustration) and we will walk you through what’s happening. Below is our model (again, don’t panic, we will walk you through what’s happening).

If you don’t know what Keras Sequential API is doing, it stacks layers like lego bricks and connects them. Each layer has different hyperparameters, the Conv2D layer takes a number of convolution filters, kernel size, and activation function, while MaxPooling2D takes pooling size, and the dense layer takes a number of output units (again, don’t panic).

Most of the convnet implementations don’t do padding for the sake of letting the kernel touch every pixel in an image processing fashion. Padding with zeroes comes with an assumption that we might have features in borders, and it adds complexity for calculation on top. That’s why you see that the first input size is (26,26), we lose information along the borders.

model = keras.Sequential(
    [
        keras.Input(shape=input_shape),
        layers.Conv2D(32, kernel_size=(3, 3), activation="relu"),
        layers.MaxPooling2D(pool_size=(2, 2)),
        layers.Conv2D(64, kernel_size=(3, 3), activation="relu"),
        layers.MaxPooling2D(pool_size=(2, 2)),
        layers.Flatten(),
        layers.Dropout(0.5),
        layers.Dense(num_classes, activation="softmax"),
    ]
)
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d (Conv2D)              (None, 26, 26, 32)        320       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 13, 13, 32)        0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 11, 11, 64)        18496     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 5, 5, 64)          0         
_________________________________________________________________
flatten (Flatten)            (None, 1600)              0         
_________________________________________________________________
dropout (Dropout)            (None, 1600)              0         
_________________________________________________________________
dense (Dense)                (None, 10)                16010     
=================================================================
Total params: 34,826
Trainable params: 34,826
Non-trainable params: 0
_________________________________________________________________

Convolutional neural networks start with an input layer and a convolutional layer. Keras Conv2D layers take a number of kernels and the size of the kernel as parameters. What’s happening is illustrated below. Here, we convolve the image with 32 kernels and end up with 32 feature maps, each having the size of the image.

After convolutional layers, we put a max pooling layer to reduce the number of parameters stored and make the model robust to the changes, as discussed above. This will reduce the number of parameters calculated.

Then, these feature maps are concatenated together and flattened.

Later on, we use something called dropout to drop a portion of parameters to avoid overfitting. Finally, the final form of weights will go through a dense layer to get classified and backpropagation will take place.

Backpropagation in Convolutional Neural Networks in Theory

How does the backpropagation work here? We want to optimize for the best kernel values here, so they’re our weights. In the end, we expect the classifier to figure out the relationship between pixel values, kernels, and classes. Thus, we have a very long flattened array consisting of elements that are pooled and activated versions of pixels convolved with initial weights (the kernel elements). We update those weights such that we answer the question “which kernels should I apply to make a distinction between cat and a dog photo?”. The point of training CNNs is to come up with the optimal kernels, and these are found using backpropagation. Prior to CNNs, people would try to try a lot of filters on an image to extract features themselves, meanwhile most generic filters (as we’ve seen above, e.g., Prewitt or Sobel) do not necessarily work for all images given images can be very different, even in the same dataset. This is why CNNs outperform traditional image processing techniques.

There are a couple of advantages by means of storage when we use convolutional neural networks.

Parameter sharing

In convolutional neural networks, we convolve with the same filter across all pixels, all channels, and all images which provides an advantage over storing parameters, this is much more efficient than going through an image with a dense neural network. This is called “weight tying” and those weights are called “tied weights”. This is also seen in autoencoders.

Sparse Interactions

In densely connected neural networks, we input the whole piece of data at once -which is very overwhelming due to how images have hundreds or thousands of pixels-, meanwhile in convnets, we have smaller kernels that we use to extract features. This is called sparse interaction, and it helps us use less memory.