Introduction to Convolutional Neural Networks

4. Convolutional Neural Networks Part 1: Filters

So how are Convolutional Neural Networks (CNNs) more effective than traditional neural networks for image recognition?

The general idea of CNNs is to understand the process of breaking down an image we discussed in the previous lesson.

That would mean, CNNs have to be able to recognize pixel positions relative to each other and the fact that cars or ducks can be placed anywhere in an image.

They do this starting with a filter.

A filter can vary in sizes depending on what the user inputs (3x3, 4x4, etc.) and moves across an image starting from the top left and ending at the bottom right.

At each point, a value is calculated from the filter through an operation called a convolution.

This may be hard to visualize so the image below may be helpful:

Each filter is meant to understand a feature of an image. So with the previous car example, one filter could understand what a headlight looks like and another for the wheels and so on.

By having filters as input to a CNN, you no longer need thousands or millions of inputs like you do with traditional neural networks. This also solves the location issue since the filters move across the image, the location no longer matters as long as the general shape is maintained.

Filters are usually initiated to random values when building a CNN, which then continuously update as the network is trained (just like a traditional neural network).

Filters are also sometimes called kernels.

Here’s another image that illustrates how filters move across an image (note that you can also adjust how many pixels it moves with the stride size).

After a filter finishes moving across an image, a feature map is generated for each which is then sent to an activation function to determine if whatever that filter is meant to detect is present in an image.

To summarize filters and convolutional layers:

Pooling layers are sometimes used to reduce the dimensions of an image by taking either the max or average (these are the most commonly used) of small sections of an image and combining the max or average values into a smaller image.

To summarize pooling layers: