6
features we have detected and classified. An in depth examination of the UNet architecture is
beyond the scope of this introduction. The original paper (“Fully Convolutional Network”)
details the theory and approach of the model.
Methods
After researching the pros and cons of different statistical models for the computer vision
task of semantic image segmentation, we chose to follow a basic CNN for image classification
tutorial in the second edition of “An Introduction to Statistical Learning with Applications in R.”
This tutorial was fairly straightforward and demonstrated the power of a simple CNN.After
following the installation instructions(“Installation…”), we built a CNN with four pairs of two-
dimensional convolution and pooling lawyers with 32, 64,128, and 256 convolutional filters with
ReLU activation, a convolutional kernel size of 3x3, and 2 x 2 max pooling. We experimented
with the architecture to finally use a model with three layers and a higher dropout rate of 0.65 for
best results. For complete code, see page 19 of the appendix .
This guided investigation gave us more understanding and confidence to tackle a more
challenging semantic segmentation task. Through more research, we came across a
“Kaggle.com” competition dataset for semantic segmentation that included 66 JPG satellite
images of dubai that included aerial perspectives of roads, land, buildings, vegetation, and water.
All 66 images were accompanied by a hand labeled mask PNG image consisting of color-
segmentations into 6 classes: roads, land, buildings, vegetation, water, and unlabeled. Though
small, this dataset seemed like a great place to start in our exploration of semantic segmentation
(Humans in The Loop).
Once familiar with the data we would attempt to conduct statistical learning on, we found
a repository with python code that implemented Tensorflow’s UNET model. Following the
framework laid out in the repository, we began implementing the Tensorflow UNET model into
our own python class. From there, we were ready to test the model, tweak hyper parameters, and
predict on unseen data. The one final challenge in our way was pre-processing. The images from
the dataset would need to be reduced into matrix representations that our UNET model could
understand. This process of pre-processing included many steps.
We began by reading through each image with the “os” package and cropping to a
divisible patch size. Using the packages Numpy and Patchify, we broke up each image into a set
of smaller images that we would train the model on. We then repeated this process for the PNG
mask images. The result of this step was to populate a new matrix of cleaned and “patchified”
training feature images and a new matrix of cleaned and “patchified” training label images.
From there, we were able to make a new instance of our UNET model, pass it some
hyper parameters like number of epochs, batch size, and validation criteria, and then run the fit
function. For each epoch, our model would print out the time it took to run, training loss, training
accuracy, Jacard coefficient, validation loss, validation accuracy, and validation Jacard
coefficient, to give us a sense of how the model was performing the more training it received.