Semantic Segmentation with FCN Autoencoders
Image Segmentation is the task of separating Image into different segments i.e into background and foreground. The given problem statement deals with segmenting cell nuclei from the histology images. Semantic segmentation has been performed on the provided dataset with a FCN (Fully Convolutional Network) Autoencoder model. Python and Keras library have been utilized for implementation of the proposed framework. The Encoder and decoder models are defined separately using Functional API and Sequential modelling techniques respectively, for the purpose of further experimenting on with the architecture. A data generator has been used to optimize the computation. The training images have been augmented, increasing the number of training samples from 590 to 1770 (rotation and flipping operation on training images have been performed) to prevent the network from overfitting the dataset. Normalizaiton along the mean as pre-processing has been done. Finally the masks obtained have been thresholded using the Otsu’s method. Dice coefficient has been employed for evaluating training. MSE (Mean square error) as loss function has been optimized using Adam for further updating weights with backpropagation.
The given dataset has 590 training samples, which have been augmented to 1770 number of samples, including 590 flipped images and 590 images rotated at a 90 degree. The training images have been resized to 320*320 and converted to grayscale. Mean based normalization as shown, has been performed on X (Training Samples) to help network converge faster
The proposed FCN based Autoencoder consists of two sub-models an encoder and a decoder. The Encoder unit for the designed autoencoder consists of four weight layers, each convolutional, with 3 x 3 size filters. In between convolution layers, a simple max pooling operation is employed with kernel dimension 2 x 2. The decoder model has four weight layers similar to the encoder, each convolutional, with kernel dimensions identical to the encoder in an attempt to reconstruct the input. In place of a maxpooling layer in encoder the decoder has an upsampling layer with filter dimension 2 x 2. For adding nonlinearity, Relu activaiton for encoder unit and Leakyrelu for decoder unit has been used, to prevent back propagating gradients from vanishing or exploding, a classic machine learning heckle often faced when using sigmoid activation. For evaluating the training performance Dice coefficient has been used, Considering two sets X and Y this coefficient can be used to measure the similarity among the two sets. For the task of Semantic Segmentation, this metric can indicate if the model is learning meaningful relationship between the input image and the corresponding mask, higher the dice coefficient the better. If the two sets are identical (i.e. they contain the same elements), the coefficient is equal to 1.0, while if X and Y have no elements in common, it is equal to 0.0. Otherwise it is somewhere in between.
The weight initialization for encoder and decoder models is with the keras inbuilt initializer Glorot uniform, that takes the number of input and output units in the weight tensor into consideration. The padding has been set to same, which ensures that the output feature map size is the same as the input feature map size, hence the down sampling is carried out only witht the maxpooling layer, if k x k is the pooling kernel size, the feature map dimensions, M x N say, would reduce to M/k x N/k. This makes it easy to tune hyperparameters such as Image size and kernel dimensions for the convolution operation. The optimizer used is Adam, which is a gradient descent optimization that utilizes the first and second moment of gradients for its computation. The weights are updated every eight training samples, and total number of epochs was chosen to be 150, as the dice coefficient and loss of autoencoder stopped updating after 150 epoch
The non-linear activation for the encoder and decoder were experimentally selected to be Relu and Leakyrelu respectively. Choosing Leakyrelu for the encoder fixes the dying ReLU problem, since it doesn’t have zero-slope sections however for the encoder using relu or Leakyrelu did not make much difference in terms of training performance metric. The dimension of the convolution kernels are kept small (3 X 3) and the stride was one, to ensure vast information extraction to be used in later layers and complex feature learning in comparison with larger filter sizes which learn generic features. As a trade of between the kernel dimensions the number of filters per convolution layer, lower kernel sizes but higher number of filters have been used.