As an attempt to understand Convolutional Neural Network (CNN/ConvNet) better, I was suggested to read the section about **LeNet5** in the original paper and figure out where every numbers come from 🤔

###### Input layer

The input of this neural network is an image of size `32*32`

pixels where each pixels are represented by an input neuron. Pixel value `0..255`

is normalized to `-0.1..1.75`

so that the mean is 0 and the variance is around 1.

The hidden layers (C1 up to F6) use Hyperbolic Tangent or `tanh`

as activation function.

###### Convolution layer 1 (C1)

Since a local receptive fields of size `5*5`

are chosen hence the shared weight size (kernel) is also `5*5`

for each feature maps. Since the kernel has `1`

bias and `6`

feature maps are required therefore the number of trainable parameters (weights & biases) are

`trainable params = (weight * input maps + bias) * feature maps = (5 * 5 * 1 + 1) * 6 = 156 `

Since the size of each feature maps are `28*28`

,

`connections = (input + bias) * feature maps * feature map size = (5 * 5 + 1) * 6 * 28 * 28 = 122304 `

The size of feature map `28x28`

is the consequence of intentional overlapping pixels by `4`

columns.

###### Subsampling layer 2 (S2)

In this layer the kernel size is `2*2`

and weights are shared. The differences from C1 are no pixel overlapping and only `1`

weight and `1`

bias used per feature maps. Since the output of this layer is `6`

feature maps (the same as input maps),

`trainable params = (weight + bias) * feature maps = (1 + 1) * 6 = 12 `

Since the size of each feature maps is `14 * 14`

,

`connections = (input + bias) * feature maps * feature map size = (2 * 2 + 1) * 6 * 14 * 14 = 5880 `

###### Convolutional layer 3 (C3)

C3 layer is similar to C1 except that there are more than one input maps and each (output) feature maps are connected to different input maps. These are the arrangement of `16`

feature maps of size `10*10`

:

- First
`6`

feature maps are connected to`3`

contiguous input maps each (overlapping 2 maps) - Second
`6`

feature maps are connected to`4`

contiguous input maps (overlapping 3 maps) - Next
`3`

feature maps are connected to`4`

discontinuous input maps (overlapping 1 map) - Last
`1`

feature map are connected to all`6`

input maps

Hence,

`trainable params = (weight * input maps + bias) * feature maps 1st group = (5 * 5 * 3 + 1) * 6 = 456 2nd group = (5 * 5 * 4 + 1) * 6 = 606 3rd group = (5 * 5 * 4 + 1) * 3 = 303 4th group = (5 * 5 * 6 + 1) * 1 = 151 all group = 456 + 606 + 303 + 151 = 1516 `

then,

`connections = (input + bias) * feature maps * feature map size = trainable params * feature map size = 1516 * 10 * 10 = 151600 `

###### Subsampling layer 4 (S4)

Similar to S2 except the number of feature maps is `16`

(the same as input maps), and each of the is `5*5`

pixels. Hence,

`trainable params = (weight + bias) * feature maps = (1 + 1) * 16 = 32 `

and,

`connections = (input + bias) * feature maps * feature map size = (2 * 2 + 1) * 16 * 5 * 5 = 2000 `

###### Convolution layer (C5)

The last convolution layer is similar to C3 except the number of feature maps is `120`

and each of them are connected to all input maps. Hence,

`trainable params = (weight * input maps + bias) * feature maps = (5 * 5 * 16 + 1) * 120 = 48120 `

then,

`connections = (input + bias) * feature maps * feature map size = trainable params * feature map size = 48120 * 1 * 1 = 48120 `

###### Fully-connected layer (F6)

This layer is just a simple neural network layer with `84`

output neurons. Hence,

`trainable params = connections = (input + bias) * output = (120 + 1) * 84 = 10164 `

###### Output layer (F6)

Finally, the output layer consists of `10`

Euclidean Radian Basis Function (RBF) units that matches the number of classes.

And we are done! 🍻