
Densely connected convolutional networks (DenseNet)

Figure 2.4.1: A 4-layer Dense block in DenseNet. The input to each layer is made of all the previous feature maps.
DenseNet attacks the problem of vanishing gradient using a different approach. Instead of using shortcut connections, all the previous feature maps will become the input of the next layer. The preceding figure, shows an example of a dense interconnection in one Dense block.
For simplicity, in this figure, we'll only show four layers. Notice that the input to layer l is the concatenation of all previous feature maps. If we designate the BN-ReLU-Conv2D
as the operation H(x), then the output of layer l is:

(Equation 2.4.1)
Conv2D
uses a kernel of size 3. The number of feature maps generated per layer is called the growth rate, k. Normally, k = 12, but k = 24 is also used in the paper, Densely Connected Convolutional Networks, Huang, and others, 2017 [5]. Therefore, if the number of feature maps

is

, then the total number of feature maps at the end of the 4-layer Dense block in Figure 2.4.1 will be

.
DenseNet also recommends that the Dense block is preceded by BN-ReLU-Conv2D
, along with the number of feature maps twice the growth rate,

. Therefore, at the end of the Dense block, the total number of feature maps will be 72. We'll also use the same kernel size, which is 3. At the output layer, DenseNet suggests that we perform an average pooling before the Dense()
and softmax
classifier. If the data augmentation is not used, a dropout layer must follow the Dense block Conv2D
:

Figure 2.4.2: A layer in a Dense block of DenseNet, with and without the bottleneck layer BN-ReLU-Conv2D(1). We'll include the kernel size as an argument of Conv2D for clarity.
As the network gets deeper, two new problems will occur. Firstly, since every layer contributes k feature maps, the number of inputs at layer l is

. Therefore, the feature maps can grow rapidly within deep layers, resulting in the computation becoming slow. For example, for a 101-layer network this will be 1200 + 24 = 1224 for k = 12.
Secondly, similar to ResNet, as the network gets deeper the feature maps size will be reduced to increase the coverage of the kernel. If DenseNet uses concatenation in the merge operation, it must reconcile the differences in size.
To prevent the number of feature maps from increasing to the point of being computationally inefficient, DenseNet introduced the Bottleneck layer as shown in Figure 2.4.2. The idea is that after every concatenation; a 1 × 1 convolution with a filter size equal to 4k is now applied. This dimensionality reduction technique prevents the number of feature maps to be processed by Conv2D(3)
from rapidly increasing.
The Bottleneck layer then modifies the DenseNet layer as BN-ReLU-Conv2D(1)-BN-ReLU-Conv2D(3)
, instead of just BN-ReLU-Conv2D(3)
. We've included the kernel size as an argument of Conv2D
for clarity. With the Bottleneck layer, every Conv2D(3)
is processing just the 4k feature maps instead of

for layer l. For example, for the 101-layer network, the input to the last Conv2D(3)
is still 48 feature maps for k = 12 instead of 1224 as computed previously:

Figure 2.4.3: The transition layer in between two Dense blocks
To solve the problem in feature maps size mismatch, DenseNet divides a deep network into multiple dense blocks that are joined together by transition layers as shown in the preceding figure. Within each dense block, the feature map size (that is, width and height) will remain constant.
The role of the transition layer is to transition from one feature map size to a smaller feature map size between two dense blocks. The reduction in size is usually half. This is accomplished by the average pooling layer. For example, an AveragePooling2D
with default pool_size=2
reduces the size from (64, 64, 256) to (32, 32, 256). The input to the transition layer is the output of the last concatenation layer in the previous dense block.
However, before the feature maps are passed to average pooling, their number will be reduced by a certain compression factor,

, using Conv2D(1)
. DenseNet uses

in their experiment. For example, if the output of the last concatenation of the previous dense block is (64, 64, 512), then after Conv2D(1)
the new dimensions of the feature maps will be (64, 64, 256). When compression and dimensionality reduction are put together, the transition layer is made of BN-Conv2D(1)-AveragePooling2D
layers. In practice, batch normalization precedes the convolutional layer.
Building a 100-layer DenseNet-BC for CIFAR10
We're now going to build a DenseNet-BC (Bottleneck-Compression) with 100 layers for the CIFAR10 dataset, using the design principles that we discussed above.
Following table, shows the model configuration, while Figure 2.4.3 shows the model architecture. Listing 2.4.1 shows us the partial Keras implementation of DenseNet-BC with 100 layers. We need to take note that we use RMSprop
since it converges better than SGD or Adam when using DenseNet.

Table 2.4.1: DenseNet-BC with 100 layers for CIFAR10 classification

Figure 2.4.3: Model architecture of DenseNet-BC with 100 layers for CIFAR10 classification
Listing 2.4.1, densenet-cifar10-2.4.1.py
: Partial Keras implementation of DenseNet-BC with 100 layers as shown in Table 2.4.1:
# start model definition # densenet CNNs (composite function) are made of BN-ReLU-Conv2D inputs = Input(shape=input_shape) x = BatchNormalization()(inputs) x = Activation('relu')(x) x = Conv2D(num_filters_bef_dense_block, kernel_size=3, padding='same', kernel_initializer='he_normal')(x) x = concatenate([inputs, x]) # stack of dense blocks bridged by transition layers for i in range(num_dense_blocks): # a dense block is a stack of bottleneck layers for j in range(num_bottleneck_layers): y = BatchNormalization()(x) y = Activation('relu')(y) y = Conv2D(4 * growth_rate, kernel_size=1, padding='same', kernel_initializer='he_normal')(y) if not data_augmentation: y = Dropout(0.2)(y) y = BatchNormalization()(y) y = Activation('relu')(y) y = Conv2D(growth_rate, kernel_size=3, padding='same', kernel_initializer='he_normal')(y) if not data_augmentation: y = Dropout(0.2)(y) x = concatenate([x, y]) # no transition layer after the last dense block if i == num_dense_blocks - 1: continue # transition layer compresses num of feature maps and # reduces the size by 2 num_filters_bef_dense_block += num_bottleneck_layers * growth_rate num_filters_bef_dense_block = int(num_filters_bef_dense_block * compression_factor) y = BatchNormalization()(x) y = Conv2D(num_filters_bef_dense_block, kernel_size=1, padding='same', kernel_initializer='he_normal')(y) if not data_augmentation: y = Dropout(0.2)(y) x = AveragePooling2D()(y) # add classifier on top # after average pooling, size of feature map is 1 x 1 x = AveragePooling2D(pool_size=8)(x) y = Flatten()(x) outputs = Dense(num_classes, kernel_initializer='he_normal', activation='softmax')(y) # instantiate and compile model # orig paper uses SGD but RMSprop works better for DenseNet model = Model(inputs=inputs, outputs=outputs) model.compile(loss='categorical_crossentropy', optimizer=RMSprop(1e-3), metrics=['accuracy']) model.summary()
Training the Keras implementation in Listing 2.4.1 for 200 epochs achieves a 93.74% accuracy vs. the 95.49% as reported in the paper. Data augmentation is used. We used the same callback functions in ResNet v1/v2 for DenseNet.
For the deeper layers, the growth_rate
and depth
variables must be changed using the table on the Python code. However, it will take a substantial amount of time to train the network at a depth of 250, or 190 as done in the paper. To give us an idea of training time, each epoch runs for about an hour on a 1060Ti GPU. Though there is also an implementation of DenseNet in the Keras applications module, it was trained on ImageNet.