1
Andrew Pinkham
Henry Jones
Colorado College
4/20/22
Image Classification and Semantic Segmentation using Convolutional Neural Networks
and U-Net
Abstract:
In 2022, computer vision is an underlying staple of almost every piece of technology that
surrounds us. Of this broad domain of modern machine learning research, tasks like object
detection, ‘deep fakes’, autonomous driving and much more, can be reduced to image processing
and more specifically image segmentation. This paper undergoes an exploration of how
Convolutional Neural Networks can attempt to solve the subproblem of semantic image
segmentation. After reviewing the theory of Neural Networks, Convolutional Neural Networks,
and a specialized model called UNET, an application is provided on a Kaggle.com dataset of
aerial satellite images from Dubai. Approaches and results are discussed to reason about model
accuracy, possibilities of overfitting, and future modifications.
Introduction
The human mind and eye are remarkably adept at recognizing abstract, low-level features
in images and contextually processing these to assign vague shapes and patterns to some object
categories that we are already familiar with. Over the last few decades, computer scientists have
undergone many iterations of statistical learning methods to bring about the future of computer
vision, where imagining classification and semantic segmentation are at the forefront of machine
learning. From automated driving computer vision tasks to biomedical image diagnostics to
facial recognition software and the infamous ‘deep fake’, the ability for computers to reduce
thousands of pixels to their semantic constituents presents unprecedented opportunities in
automating the collection and analysis of information hidden within images.
Computer vision and image classification also have the potential to revolutionize
environmental research ranging from land and polar ice modeling, forest and vegetation
dynamics, to agriculture. Given our interests in computer vision and environmental geospatial
data, this paper was brought about by our exploration of techniques for classifying and mapping
satellite imagery to various semantic labels. Thus our main goal was to investigate the potential
power of statistical learning on different modes of image segmentation, classification, and
information processing.
Investigation of several different models eventually led us to explore in greater depth the
power of Convolutional Neural Networks or CNNs. CNNs have been shown to outperform other
2
models when tasked with image-based computer vision because of their convolutional context
learning scheme. As our research pushed us towards CNN’s we decided to focus on the more
concrete goals of first understanding and implementing a simple CNN in R. After this was
accomplished, we committed to implementing a special type of CNN called a UNet model to
both predict image segmentation for a labeled dataset of satellite imagery of Dubai and assess the
model accuracy.
Background information
In embarking on this paper, we conducted research into the predominant image
classification techniques accessible through our text “An Introduction to Statistical Learning”
and various papers predominant in the field of satellite image classification (Supervised
Classification of Radarsat-2, Kakarla and Syam, Garg, Rajat, et al.).
In our research we encountered several techniques already seen in the class such as random
forests and K nearest neighbors adapted image classification, but the predominant papers
emphasized the superiority of supported vector machines (SVMs) and Convolutional Neural
Networks(CNNs) in these visual tasks. Further research indicated CNN’s were the best choice,
particularly when implemented with the so-called “U-Net” architecture first used in 2017(“Fully
Convolutional…”). In order to build CNN and UNet architectures, we now provide a brief
introduction to the theory of single and multilayer Neural networks and then CNN’s.
Single Layer Neural Networks
Neural networks take an input vector of p-dimensions and builds a
nonlinear function f(X) to predict the response Y .The p features of the observations make up the
units in the input layer, and all of these features from the input layer are then passed into each of
the K hidden units (K is arbitrary). The schematic of such a model is seen below in Figure 1.
Figure 1
3
The model is built in two steps. First, the K activations , in the hidden layer are
computed as functions of the input features such that
Where the parameters all must be estimated from data and g(y) is a nonlinear
activation function that is specified in advance. The preferred choice of activation function is
typically the ReLU (rectified linear unit) activation function, which takes the form
The ReLU function is ideal due to its storage and computational efficiency. Thus our model
essentially computed five different linear combinations of X and then passes these linear
combinations to the nonlinear activation function which returns the activations. Thus each
is essentially a different transformation of the original p features. We then compute the
output function f(x) by taking the linear combination of the activations to yield
where the intercept and coefficients are all estimated from data. This output function is
a linear regression model in the K activations as opposed to the original predictors themselves.
Fitting a neural network requires estimating the unknown parameters and , which as usual is
accomplished through an optimization problem. For a quantitative response, typically squared-
error loss is used so that the parameters are chosen to minimize
,
or the cross-entropy quantity to be discussed next in multilayer neural networks.
Multilayer Neural Networks
Multilayer Neural networks use much of the same framework except they make use of
multiple hidden layers. The first hidden layer is computed as discussed after choosing , and
then the second hidden layer with units is computed by using the activations in the first
hidden layer as inputs and then computing by executing
For . By this process, all activations are functions of the original X observations and
we are able to build complex transformations of X. Eventually we arrive at the output layer
where the function (or functions) predict the response Y similar to (1.1) but where f(X) is a linear
combination of all activations in the last hidden layer. In the case of a classification or qualitative
response, we can have as many output functions as there are levels of classification l, and thus
4
our work is not yet done. In this case, our estimates must represent class probabilities
and we do this using the special softmax activation function
Where is given by
and where are the activations for the last hidden layer for . This softmax
activation has similarities to the logistic regression in that our outputs now behave like
probabilities as they are non-negative and sum to one. Class assignment is accomplished by
assigning images to the class with the highest softmax activation value. Additionally, to train our
network for a categorical response we compute coefficients that minimize categorical cross
entropy, the negative multinomial log-likelihood given by
For regression multilayer networks we simply minimize squared-error loss as discussed for
regression single layer networks. It should be noted we need a very high number of coefficients
for these networks, and we also must work with very large training sets to accrue patterns for the
model to learn. To avoid overfitting, we often need to use forms of regularization such as ridge
regularization and dropout regularization which are analogous to regularization seen elsewhere
in the textbook.
Convolutional Neural Networks
CNN's are designed to mimic how humans classify images, by recognizing specific
features or patterns anywhere in the image that distinguish each particular object class. CNNs
first identify low-level features in the input image such as small edges, color patches, and then
these are combined to form higher level features like edges, trees, roads, buildings, or anything
relevant in the training images.. The accumulation of context and presence of distinguishing
larger-level features of the image eventually contributes to the probability of an image being
assigned to a particular class.
Two types of hidden layers called convolution layers and pooling layers make up CNNs.
Convolution layers search for the presence of small patterns in the image, and pooling layers
perform dimensional reduction to select a subset which encodes prominence with regards to the
specific filter used. Deep CNNs make use of many convolutional and pooling layers carefully
designed with regularization methods between layers.
A convolutional layer is composed of a large number of convolutional filters, each of
which is a template that checks for the presence of a specific local feature or pattern. In the case
of CNN’s use in two-dimensional image classification, an image is convolved when it undergoes
5
a convolution operation by a particular convolution filter, a small array as demonstrated
below.
Figure 1.2
*
If a submatrix of the original image resembles the convolution filter, then the algebra of the
convolution will grant this region a large value in the convolved image and otherwise regions
will receive small values. The convolved image thus highlights subregions of the original image
which resemble the convolution filter. The filters are learned for the specific classification task in
CNNs and the filter weights ( in Figure 1.2) as the parameters going from an input layer
to a hidden layer, with one hidden unit for each pixel in the convolved image.
In the case of RGB image analysis, a single convolutional filter has three channels, where
each is a two dimensional feature map of pixels for each color with potentially different filter
weights for each channel. The three convolved images are summed to form a two dimensional
output feature map with channels after the ith hidden layer.
The second type of hidden layers in a CNN are pooling layers which reduce the
dimensions of convolved images to provide lower-dimensional summaries to pass to subsequent
convolution or flattening layers. This is commonly done through max pooling illustrated in
Figure 1.3.
Figure 1.3
Because convolution layers return images with high values indicating feature detection, max
pooling summarizes the presence of feature detection while balancing location information and
location invariance. After some sequence of convolutions and pooling, we flatten the feature
maps to pass pixels individually to a fully connected layer and then to the softmax activation
output layer.
UNet is a particularly powerful architecture of a CNN which instead of outputting a
certain classification or numerical variable, our output is a semantically segmented image of
equal or similar dimension to the input image. In addition to the sequences of convolution and
pooling layers which decrease dimensionality and increase granularity of our image, we also
reverse this direction by then creating masks or predicting images based on the location of
6
features we have detected and classified. An in depth examination of the UNet architecture is
beyond the scope of this introduction. The original paper (“Fully Convolutional Network”)
details the theory and approach of the model.
Methods
After researching the pros and cons of different statistical models for the computer vision
task of semantic image segmentation, we chose to follow a basic CNN for image classification
tutorial in the second edition of “An Introduction to Statistical Learning with Applications in R.”
This tutorial was fairly straightforward and demonstrated the power of a simple CNN.After
following the installation instructions(“Installation…”), we built a CNN with four pairs of two-
dimensional convolution and pooling lawyers with 32, 64,128, and 256 convolutional filters with
ReLU activation, a convolutional kernel size of 3x3, and 2 x 2 max pooling. We experimented
with the architecture to finally use a model with three layers and a higher dropout rate of 0.65 for
best results. For complete code, see page 19 of the appendix .
This guided investigation gave us more understanding and confidence to tackle a more
challenging semantic segmentation task. Through more research, we came across a
“Kaggle.com” competition dataset for semantic segmentation that included 66 JPG satellite
images of dubai that included aerial perspectives of roads, land, buildings, vegetation, and water.
All 66 images were accompanied by a hand labeled mask PNG image consisting of color-
segmentations into 6 classes: roads, land, buildings, vegetation, water, and unlabeled. Though
small, this dataset seemed like a great place to start in our exploration of semantic segmentation
(Humans in The Loop).
Once familiar with the data we would attempt to conduct statistical learning on, we found
a repository with python code that implemented Tensorflow’s UNET model. Following the
framework laid out in the repository, we began implementing the Tensorflow UNET model into
our own python class. From there, we were ready to test the model, tweak hyper parameters, and
predict on unseen data. The one final challenge in our way was pre-processing. The images from
the dataset would need to be reduced into matrix representations that our UNET model could
understand. This process of pre-processing included many steps.
We began by reading through each image with the “os” package and cropping to a
divisible patch size. Using the packages Numpy and Patchify, we broke up each image into a set
of smaller images that we would train the model on. We then repeated this process for the PNG
mask images. The result of this step was to populate a new matrix of cleaned and “patchified”
training feature images and a new matrix of cleaned and “patchified” training label images.
From there, we were able to make a new instance of our UNET model, pass it some
hyper parameters like number of epochs, batch size, and validation criteria, and then run the fit
function. For each epoch, our model would print out the time it took to run, training loss, training
accuracy, Jacard coefficient, validation loss, validation accuracy, and validation Jacard
coefficient, to give us a sense of how the model was performing the more training it received.
7
Each epoch on a 2.7GHz Quad Core CPU took roughly 329s. Performing the training on a GPU
would vastly speed up the training time.
After the model finished training, we used Tensorflows built-in save function to store a
file representation of the model on our machine so that we can easily load it for future
predictions, without having to retrain the model. Lastly, our script would display some graphs of
the training and validation loss over time as the model trained, and some example predictions on
some of the training data.
To encapsulate the project, we developed a simple python GUI application with TKinter
which allows users to take a custom screenshot of their computer and then directly pipe the
image into our model. After a 30 second loading time, the application displays the prediction to
the user and offers to save the images to the computer.
Results and Data Analysis
We trained four main UNET models during the course of our investigation. Each model
shared all the same hyper parameters except for epochs. One model ran for 2 epochs, another for
5, another for 10, and another for 100.
Our 100 epoch model out performed all others with training IoU approaching 0.9 and
validation IoU exceeding 0.7. The 10 epoch model had both training and validation IoU
approaching 0.6. The 5 epoch model had both training and validation IoU approaching 0.55. We
did not capture IoU for our 2 epoch model. See figures below.
8
We were not able to capture the loss for our 100 epoch model or our 2 epoch model, though we
did collect loss graphs for both other models. The 10 epoch model had both training and
validation loss approaching 0.91. The 5 epoch model had both training and validation loss
approaching 0.93. See figures below.
9
Discussion
The UNET model is certainly interesting technology but our implementation is far from perfect.
The challenge behind a project like this is to spend time and thought dialing in hyper parameters
and tweaking the model to produce better results. We concluded that even though our metrics for
model accuracy were increasing with more training epochs, our model was perhaps over fitting
itself. We came to this decision through visually inspecting our models performance on unseen
data that did not come from the true dataset. Our hope was to develop a robust generic aerial
image classifier that has the ability to semantically segment any type of aerial image. Our model
performs very well on the specific Dubai image dataset, but the 100 epoch model performs
poorly on other satellite imagery.
After thinking about this problem, we chose to implement our 10 epoch model for our
application because our graphs of IoU loss displayed clear diminishing returns around the 10
epoch mark. With much more than 10 epochs, we expect to see the beginnings of overfitting.
The GUI application is in its early phases but provides a quick and easy implementation in a
somewhat real-world fashion to demonstrate our model. Unfortunately it does take some time to
render a prediction once the user has taken a screenshot. The reason for this is because of the run
time of a smoothing algorithm which aims to produce a better visual prediction and unfortunately
requires some computational complexity.
Improving on our work so far would require more hours dedicated to tweaking hyperparameters
of the UNET model. Additionally, since the model is only trained on the Dubai dataset, its
prediction power on unseen unlike data is not remarkable. Training the model on additional data
would certainly improve this power. However, hand labeling image segmentations is a challenge
on its own. Perhaps developing a hybrid CNN model that can be trained in a supervised and
unsupervised manner would be an interesting future project. At the very least, we hope the GUI
10
application alongside the git repository will inspire others to learn more about and tackle difficult
computer vision tasks.
11
Work Cited
73 - Image Segmentation Using U-Net - PART1 ... - Youtube.
https://www.youtube.com/watch?v=azM57JuQpQI.
Bnsreenu. “Bnsreenu/python_for_microscopists: Https://Www.youtube.com/Channel/uc34rw-
htpjulxr5wp2xa04w?sub_confirmation=1.” GitHub,
https://github.com/bnsreenu/python_for_microscopists.
“Fully Convolutional Networks for Semantic Segmentation.” IEEE Xplore,
https://ieeexplore.ieee.org/document/7478072.
Garg, Rajat, et al. “Semantic Segmentation of POLSAR Image Data Using Advanced Deep
Learning Model.” Nature News, Nature Publishing Group, 28 July 2021,
https://www.nature.com/articles/s41598-021-94422-y#Tab1.
“An Introduction to Statistical Learning.” An Introduction to Statistical Learning,
https://www.statlearning.com/.
“Installation Guide: Python, Reticulate, and Keras” https://hastie.su.domains/ISLR2/keras-
instructions.html
JohnyJohny 17722 silver badges1212 bronze badges, et al. “How to Train an SVM Classifier on
a Satellite Image Using Python.” Stack Overflow, 1 Jan. 1965,
https://stackoverflow.com/questions/43331510/how-to-train-an-svm-classifier-on-a-satellite-
image-using-python.
Kakarla, Syam. “Hyperspectral Image Analysis - Getting Started.” Medium, Towards Data
Science, 7 Apr. 2021, https://towardsdatascience.com/hyperspectral-image-analysis-getting-
started-74758c12f2e9.
Kakarla, Syam. “Land Cover Classification of Satellite Imagery Using Convolutional Neural
Networks.” Medium, Towards Data Science, 3 Jan. 2021, https://towardsdatascience.com/land-
cover-classification-of-satellite-imagery-using-convolutional-neural-networks-91b5bb7fe808.
Loop, Humans In The. “Semantic Segmentation of Aerial Imagery.” Kaggle, 29 May 2020,
https://www.kaggle.com/datasets/humansintheloop/semantic-segmentation-of-aerial-imagery.
Supervised Classification of Radarsat-2 ... - Arxiv. https://arxiv.org/pdf/1608.00501.pdf.
12
Ronneberger, Fischer, Brox “U-Net: Convolutional Networks for Biomedical Image
Segmentation”
https://www.semanticscholar.org/paper/U-Net%3A-Convolutional-Networks-for-Biomedical-
Image-Ronneberger-Fischer/6364fdaa0a0eccd823a779fcdd489173f938e91a
Vooban. “Vooban/Smoothly-Blend-Image-Patches: Using a U-Net for Image Segmentation,
Blending Predicted Patches Smoothly Is a Must to Please the Human Eye.” GitHub,
https://github.com/Vooban/Smoothly-Blend-Image-Patches.
13
Appendix
See following pages for code screenshots.
See https://github.com/andrewcolepinkham/semanticsegmentation for full code repository.
14
15
16
17
18
19
20