Andrew Pinkham

Henry Jones

Colorado College

4/20/22

Image Classification and Semantic Segmentation using Convolutional Neural Networks

and U-Net

Abstract:

In 2022, computer vision is an underlying staple of almost every piece of technology that

surrounds us. Of this broad domain of modern machine learning research, tasks like object

detection, ‘deep fakes’, autonomous driving and much more, can be reduced to image processing

and more specifically image segmentation. This paper undergoes an exploration of how

Convolutional Neural Networks can attempt to solve the subproblem of semantic image

segmentation. After reviewing the theory of Neural Networks, Convolutional Neural Networks,

and a specialized model called UNET, an application is provided on a Kaggle.com dataset of

aerial satellite images from Dubai. Approaches and results are discussed to reason about model

accuracy, possibilities of overfitting, and future modifications.

Introduction

The human mind and eye are remarkably adept at recognizing abstract, low-level features

in images and contextually processing these to assign vague shapes and patterns to some object

categories that we are already familiar with. Over the last few decades, computer scientists have

undergone many iterations of statistical learning methods to bring about the future of computer

vision, where imagining classification and semantic segmentation are at the forefront of machine

learning. From automated driving computer vision tasks to biomedical image diagnostics to

facial recognition software and the infamous ‘deep fake’, the ability for computers to reduce

thousands of pixels to their semantic constituents presents unprecedented opportunities in

automating the collection and analysis of information hidden within images.

Computer vision and image classification also have the potential to revolutionize

environmental research ranging from land and polar ice modeling, forest and vegetation

dynamics, to agriculture. Given our interests in computer vision and environmental geospatial

data, this paper was brought about by our exploration of techniques for classifying and mapping

satellite imagery to various semantic labels. Thus our main goal was to investigate the potential

power of statistical learning on different modes of image segmentation, classification, and

information processing.

Investigation of several different models eventually led us to explore in greater depth the

power of Convolutional Neural Networks or CNNs. CNNs have been shown to outperform other

models when tasked with image-based computer vision because of their convolutional context

learning scheme. As our research pushed us towards CNN’s we decided to focus on the more

concrete goals of first understanding and implementing a simple CNN in R. After this was

accomplished, we committed to implementing a special type of CNN called a UNet model to

both predict image segmentation for a labeled dataset of satellite imagery of Dubai and assess the

model accuracy.

Background information

In embarking on this paper, we conducted research into the predominant image

classification techniques accessible through our text “An Introduction to Statistical Learning”

and various papers predominant in the field of satellite image classification (Supervised

Classiﬁcation of Radarsat-2, Kakarla and Syam, Garg, Rajat, et al.).

In our research we encountered several techniques already seen in the class such as random

forests and K nearest neighbors adapted image classification, but the predominant papers

emphasized the superiority of supported vector machines (SVMs) and Convolutional Neural

Networks(CNNs) in these visual tasks. Further research indicated CNN’s were the best choice,

particularly when implemented with the so-called “U-Net” architecture first used in 2017(“Fully

Convolutional…”). In order to build CNN and UNet architectures, we now provide a brief

introduction to the theory of single and multilayer Neural networks and then CNN’s.

Single Layer Neural Networks

Neural networks take an input vector of p-dimensions and builds a

nonlinear function f(X) to predict the response Y .The p features of the observations make up the

units in the input layer, and all of these features from the input layer are then passed into each of

the K hidden units (K is arbitrary). The schematic of such a model is seen below in Figure 1.

Figure 1

The model is built in two steps. First, the K activations , in the hidden layer are

computed as functions of the input features such that

Where the parameters all must be estimated from data and g(y) is a nonlinear

activation function that is specified in advance. The preferred choice of activation function is

typically the ReLU (rectified linear unit) activation function, which takes the form

The ReLU function is ideal due to its storage and computational efficiency. Thus our model

essentially computed five different linear combinations of X and then passes these linear

combinations to the nonlinear activation function which returns the activations. Thus each

is essentially a different transformation of the original p features. We then compute the

output function f(x) by taking the linear combination of the activations to yield

where the intercept and coefficients are all estimated from data. This output function is

a linear regression model in the K activations as opposed to the original predictors themselves.

Fitting a neural network requires estimating the unknown parameters and , which as usual is

accomplished through an optimization problem. For a quantitative response, typically squared-

error loss is used so that the parameters are chosen to minimize

or the cross-entropy quantity to be discussed next in multilayer neural networks.

Multilayer Neural Networks

Multilayer Neural networks use much of the same framework except they make use of

multiple hidden layers. The first hidden layer is computed as discussed after choosing , and

then the second hidden layer with units is computed by using the activations in the first

hidden layer as inputs and then computing by executing

For . By this process, all activations are functions of the original X observations and

we are able to build complex transformations of X. Eventually we arrive at the output layer

where the function (or functions) predict the response Y similar to (1.1) but where f(X) is a linear

combination of all activations in the last hidden layer. In the case of a classification or qualitative

response, we can have as many output functions as there are levels of classification l, and thus

our work is not yet done. In this case, our estimates must represent class probabilities

and we do this using the special softmax activation function

Where is given by

and where are the activations for the last hidden layer for . This softmax

activation has similarities to the logistic regression in that our outputs now behave like

probabilities as they are non-negative and sum to one. Class assignment is accomplished by

assigning images to the class with the highest softmax activation value. Additionally, to train our

network for a categorical response we compute coefficients that minimize categorical cross

entropy, the negative multinomial log-likelihood given by

For regression multilayer networks we simply minimize squared-error loss as discussed for

regression single layer networks. It should be noted we need a very high number of coefficients

for these networks, and we also must work with very large training sets to accrue patterns for the

model to learn. To avoid overfitting, we often need to use forms of regularization such as ridge

regularization and dropout regularization which are analogous to regularization seen elsewhere

in the textbook.

Convolutional Neural Networks

CNN's are designed to mimic how humans classify images, by recognizing specific

features or patterns anywhere in the image that distinguish each particular object class. CNNs

first identify low-level features in the input image such as small edges, color patches, and then

these are combined to form higher level features like edges, trees, roads, buildings, or anything

relevant in the training images.. The accumulation of context and presence of distinguishing

larger-level features of the image eventually contributes to the probability of an image being

assigned to a particular class.

Two types of hidden layers called convolution layers and pooling layers make up CNNs.

Convolution layers search for the presence of small patterns in the image, and pooling layers

perform dimensional reduction to select a subset which encodes prominence with regards to the

specific filter used. Deep CNNs make use of many convolutional and pooling layers carefully

designed with regularization methods between layers.

A convolutional layer is composed of a large number of convolutional filters, each of

which is a template that checks for the presence of a specific local feature or pattern. In the case

of CNN’s use in two-dimensional image classification, an image is convolved when it undergoes

a convolution operation by a particular convolution filter, a small array as demonstrated

below.

Figure 1.2

If a submatrix of the original image resembles the convolution filter, then the algebra of the

convolution will grant this region a large value in the convolved image and otherwise regions

will receive small values. The convolved image thus highlights subregions of the original image

which resemble the convolution filter. The filters are learned for the specific classification task in

CNNs and the filter weights ( in Figure 1.2) as the parameters going from an input layer

to a hidden layer, with one hidden unit for each pixel in the convolved image.

In the case of RGB image analysis, a single convolutional filter has three channels, where

each is a two dimensional feature map of pixels for each color with potentially different filter

weights for each channel. The three convolved images are summed to form a two dimensional

output feature map with channels after the ith hidden layer.

The second type of hidden layers in a CNN are pooling layers which reduce the

dimensions of convolved images to provide lower-dimensional summaries to pass to subsequent

convolution or flattening layers. This is commonly done through max pooling illustrated in

Figure 1.3.

Figure 1.3

Because convolution layers return images with high values indicating feature detection, max

pooling summarizes the presence of feature detection while balancing location information and

location invariance. After some sequence of convolutions and pooling, we flatten the feature

maps to pass pixels individually to a fully connected layer and then to the softmax activation

output layer.

UNet is a particularly powerful architecture of a CNN which instead of outputting a

certain classification or numerical variable, our output is a semantically segmented image of

equal or similar dimension to the input image. In addition to the sequences of convolution and

pooling layers which decrease dimensionality and increase granularity of our image, we also

reverse this direction by then creating masks or predicting images based on the location of

features we have detected and classified. An in depth examination of the UNet architecture is

beyond the scope of this introduction. The original paper (“Fully Convolutional Network”)

details the theory and approach of the model.

Methods

After researching the pros and cons of different statistical models for the computer vision

task of semantic image segmentation, we chose to follow a basic CNN for image classification

tutorial in the second edition of “An Introduction to Statistical Learning with Applications in R.”

This tutorial was fairly straightforward and demonstrated the power of a simple CNN.After

following the installation instructions(“Installation…”), we built a CNN with four pairs of two-

dimensional convolution and pooling lawyers with 32, 64,128, and 256 convolutional filters with

ReLU activation, a convolutional kernel size of 3x3, and 2 x 2 max pooling. We experimented

with the architecture to finally use a model with three layers and a higher dropout rate of 0.65 for

best results. For complete code, see page 19 of the appendix .

This guided investigation gave us more understanding and confidence to tackle a more

challenging semantic segmentation task. Through more research, we came across a

“Kaggle.com” competition dataset for semantic segmentation that included 66 JPG satellite

images of dubai that included aerial perspectives of roads, land, buildings, vegetation, and water.

All 66 images were accompanied by a hand labeled mask PNG image consisting of color-

segmentations into 6 classes: roads, land, buildings, vegetation, water, and unlabeled. Though

small, this dataset seemed like a great place to start in our exploration of semantic segmentation

(Humans in The Loop).

Once familiar with the data we would attempt to conduct statistical learning on, we found

a repository with python code that implemented Tensorflow’s UNET model. Following the

framework laid out in the repository, we began implementing the Tensorflow UNET model into

our own python class. From there, we were ready to test the model, tweak hyper parameters, and

predict on unseen data. The one final challenge in our way was pre-processing. The images from

the dataset would need to be reduced into matrix representations that our UNET model could

understand. This process of pre-processing included many steps.

We began by reading through each image with the “os” package and cropping to a

divisible patch size. Using the packages Numpy and Patchify, we broke up each image into a set

of smaller images that we would train the model on. We then repeated this process for the PNG

mask images. The result of this step was to populate a new matrix of cleaned and “patchified”

training feature images and a new matrix of cleaned and “patchified” training label images.

From there, we were able to make a new instance of our UNET model, pass it some

hyper parameters like number of epochs, batch size, and validation criteria, and then run the fit

function. For each epoch, our model would print out the time it took to run, training loss, training

accuracy, Jacard coefficient, validation loss, validation accuracy, and validation Jacard

coefficient, to give us a sense of how the model was performing the more training it received.

Each epoch on a 2.7GHz Quad Core CPU took roughly 329s. Performing the training on a GPU

would vastly speed up the training time.

After the model finished training, we used Tensorflows built-in save function to store a

file representation of the model on our machine so that we can easily load it for future

predictions, without having to retrain the model. Lastly, our script would display some graphs of

the training and validation loss over time as the model trained, and some example predictions on

some of the training data.

To encapsulate the project, we developed a simple python GUI application with TKinter

which allows users to take a custom screenshot of their computer and then directly pipe the

image into our model. After a 30 second loading time, the application displays the prediction to

the user and offers to save the images to the computer.

Results and Data Analysis

We trained four main UNET models during the course of our investigation. Each model

shared all the same hyper parameters except for epochs. One model ran for 2 epochs, another for

5, another for 10, and another for 100.

Our 100 epoch model out performed all others with training IoU approaching 0.9 and

validation IoU exceeding 0.7. The 10 epoch model had both training and validation IoU

approaching 0.6. The 5 epoch model had both training and validation IoU approaching 0.55. We

did not capture IoU for our 2 epoch model. See figures below.

We were not able to capture the loss for our 100 epoch model or our 2 epoch model, though we

did collect loss graphs for both other models. The 10 epoch model had both training and

validation loss approaching 0.91. The 5 epoch model had both training and validation loss

approaching 0.93. See figures below.

Discussion

The UNET model is certainly interesting technology but our implementation is far from perfect.

The challenge behind a project like this is to spend time and thought dialing in hyper parameters

and tweaking the model to produce better results. We concluded that even though our metrics for

model accuracy were increasing with more training epochs, our model was perhaps over fitting

itself. We came to this decision through visually inspecting our models performance on unseen

data that did not come from the true dataset. Our hope was to develop a robust generic aerial

image classifier that has the ability to semantically segment any type of aerial image. Our model

performs very well on the specific Dubai image dataset, but the 100 epoch model performs

poorly on other satellite imagery.

After thinking about this problem, we chose to implement our 10 epoch model for our

application because our graphs of IoU loss displayed clear diminishing returns around the 10

epoch mark. With much more than 10 epochs, we expect to see the beginnings of overfitting.

The GUI application is in its early phases but provides a quick and easy implementation in a

somewhat real-world fashion to demonstrate our model. Unfortunately it does take some time to

render a prediction once the user has taken a screenshot. The reason for this is because of the run

time of a smoothing algorithm which aims to produce a better visual prediction and unfortunately

requires some computational complexity.

Improving on our work so far would require more hours dedicated to tweaking hyperparameters

of the UNET model. Additionally, since the model is only trained on the Dubai dataset, its

prediction power on unseen unlike data is not remarkable. Training the model on additional data

would certainly improve this power. However, hand labeling image segmentations is a challenge

on its own. Perhaps developing a hybrid CNN model that can be trained in a supervised and

unsupervised manner would be an interesting future project. At the very least, we hope the GUI

application alongside the git repository will inspire others to learn more about and tackle difficult

computer vision tasks.

Work Cited

73 - Image Segmentation Using U-Net - PART1 ... - Youtube.

https://www.youtube.com/watch?v=azM57JuQpQI.

Bnsreenu. “Bnsreenu/python_for_microscopists: Https://Www.youtube.com/Channel/uc34rw-

htpjulxr5wp2xa04w?sub_confirmation=1.” GitHub,

https://github.com/bnsreenu/python_for_microscopists.

“Fully Convolutional Networks for Semantic Segmentation.” IEEE Xplore,

https://ieeexplore.ieee.org/document/7478072.

Garg, Rajat, et al. “Semantic Segmentation of POLSAR Image Data Using Advanced Deep

Learning Model.” Nature News, Nature Publishing Group, 28 July 2021,

https://www.nature.com/articles/s41598-021-94422-y#Tab1.

“An Introduction to Statistical Learning.” An Introduction to Statistical Learning,

https://www.statlearning.com/.

“Installation Guide: Python, Reticulate, and Keras” https://hastie.su.domains/ISLR2/keras-

instructions.html

JohnyJohny 17722 silver badges1212 bronze badges, et al. “How to Train an SVM Classifier on

a Satellite Image Using Python.” Stack Overflow, 1 Jan. 1965,

https://stackoverflow.com/questions/43331510/how-to-train-an-svm-classifier-on-a-satellite-

image-using-python.

Kakarla, Syam. “Hyperspectral Image Analysis - Getting Started.” Medium, Towards Data

Science, 7 Apr. 2021, https://towardsdatascience.com/hyperspectral-image-analysis-getting-

started-74758c12f2e9.

Kakarla, Syam. “Land Cover Classification of Satellite Imagery Using Convolutional Neural

Networks.” Medium, Towards Data Science, 3 Jan. 2021, https://towardsdatascience.com/land-

cover-classification-of-satellite-imagery-using-convolutional-neural-networks-91b5bb7fe808.

Loop, Humans In The. “Semantic Segmentation of Aerial Imagery.” Kaggle, 29 May 2020,

https://www.kaggle.com/datasets/humansintheloop/semantic-segmentation-of-aerial-imagery.

Supervised Classiﬁcation of Radarsat-2 ... - Arxiv. https://arxiv.org/pdf/1608.00501.pdf.

Ronneberger, Fischer, Brox “U-Net: Convolutional Networks for Biomedical Image

Segmentation”

https://www.semanticscholar.org/paper/U-Net%3A-Convolutional-Networks-for-Biomedical-

Image-Ronneberger-Fischer/6364fdaa0a0eccd823a779fcdd489173f938e91a

Vooban. “Vooban/Smoothly-Blend-Image-Patches: Using a U-Net for Image Segmentation,

Blending Predicted Patches Smoothly Is a Must to Please the Human Eye.” GitHub,

https://github.com/Vooban/Smoothly-Blend-Image-Patches.

Appendix

See following pages for code screenshots.

See https://github.com/andrewcolepinkham/semanticsegmentation for full code repository.