This notebook contains edits from Coursera Deep Learning Specialization
A Convolutional Neural Network (ConvNet/CNN) is a Deep Learning algorithm which can take in an input image, assign importance (learnable weights and biases) to various aspects/objects in the image and be able to differentiate one from the other. The pre-processing required in a ConvNet is much lower as compared to other classification algorithms. While in primitive methods filters are hand-engineered, with enough training, ConvNets have the ability to learn these filters/characteristics.
A ConvNet is able to successfully capture the Spatial and Temporal dependencies in an image through the application of relevant filters. The architecture performs a better fitting to the image dataset due to the reduction in the number of parameters involved and reusability of weights. In other words, the network can be trained to understand the sophistication of the image better.
You can imagine how computationally intensive things would get once the images reach dimensions, say 8K (7680×4320). The role of the ConvNet is to reduce the images into a form which is easier to process, without losing features which are critical for getting a good prediction. This is important when we are to design an architecture which is not only good at learning features but also is scalable to massive datasets.
Edge detection is a technique of image processing used to identify points in a digital image with discontinuities, simply to say, sharp changes in the image brightness. These points where the image brightness varies sharply are called the edges (or boundaries) of the image.
It is one of the basic steps in image processing, pattern recognition in images and computer vision. When we process very high-resolution digital images, convolution techniques come to our rescue. Let us understand the convolution operation (represented in the below image using *) using an example-
Padding is a term relevant to convolutional neural networks as it refers to the amount of pixels added to an image when it is being processed by the kernel of a CNN. For example, if the padding in a CNN is set to zero, then every pixel value that is added will be of value zero.
For a gray scale (n x n) image and (f x f) filter/kernel, the dimensions of the image resulting from a convolution operation is (n – f + 1) x (n – f + 1).
For example, for an (8 x 8) image and (3 x 3) filter, the output resulting after convolution operation would be of size (6 x 6). Thus, the image shrinks every time a convolution operation is performed. This places an upper limit to the number of times such an operation could be performed before the image reduces to nothing thereby precluding us from building deeper networks.
Padding is simply a process of adding layers of zeros to our input images so as to avoid the problems mentioned above.
This prevents shrinking as, if p = number of layers of zeros added to the border of the image, then our (n x n) image becomes (n + 2p) x (n + 2p) image after padding. So, applying convolution-operation (with (f x f) filter) outputs (n + 2p – f + 1) x (n + 2p – f + 1) images. For example, adding one layer of padding to an (8 x 8) image and using a (3 x 3) filter we would get an (8 x 8) output after performing convolution operation.
In plain English, a stride is a step that you take when walking or running.
If you want to convolve, let’s say, a7 by 7 image with this 3 by 3 filter with a stride of two, what that means is you take the element Y’s product as usual in this upper left 3 by 3region and then multiply and add (that gives you 91 as the first element of the output). See figure below.
The pooling operation involves sliding a two-dimensional filter over each channel of feature map and summarising the features lying within the region covered by the filter.
For a feature map having dimensions nh x nw x nc, the dimensions of output obtained after a pooling layer is
(nh - f + 1) / s x (nw - f + 1)/s x nc
-> nh - height of feature map
-> nw - width of feature map
-> nc - number of channels in the feature map
-> f - size of filter
-> s - stride length
Max pooling is a pooling operation that selects the maximum element from the region of the feature map covered by the filter. Thus, the output after max-pooling layer would be a feature map containing the most prominent features of the previous feature map.
Average pooling computes the average of the elements present in the region of feature map covered by the filter. Thus, while max pooling gives the most prominent feature in a particular patch of the feature map, average pooling gives the average of features present in a patch.
Pooling layers are used to reduce the dimensions of the feature maps. Thus, it reduces the number of parameters to learn and the amount of computation performed in the network.
With the basics covered, it's time to explore how we can use TensorFlow to build a Convolutional neural network.
import math
import numpy as np
import h5py
import matplotlib.pyplot as plt
from matplotlib.pyplot import imread
import scipy
from PIL import Image
import pandas as pd
import tensorflow as tf
import tensorflow.keras.layers as tfl
from tensorflow.python.framework import ops
from cnn_utils import *
from test_utils import summary, comparator
%matplotlib inline
np.random.seed(1)
You'll be using the Happy House dataset for this part of the assignment, which contains images of peoples' faces. Your task will be to build a ConvNet that determines whether the people in the images are smiling or not -- because they only get to enter the house if they're smiling!
X_train_orig, Y_train_orig, X_test_orig, Y_test_orig, classes = load_happy_dataset()
# Normalize image vectors
X_train = X_train_orig/255.
X_test = X_test_orig/255.
# Reshape
Y_train = Y_train_orig.T
Y_test = Y_test_orig.T
print ("number of training examples = " + str(X_train.shape[0]))
print ("number of test examples = " + str(X_test.shape[0]))
print ("X_train shape: " + str(X_train.shape))
print ("Y_train shape: " + str(Y_train.shape))
print ("X_test shape: " + str(X_test.shape))
print ("Y_test shape: " + str(Y_test.shape))
You can display the images contained in the dataset. Images are 64x64 pixels in RGB format (3 channels).
index = 125
plt.imshow(X_train_orig[index]) #display sample training image
plt.show()
In TF Keras, you don't have to write code directly to create layers. Rather, TF Keras has pre-defined layers you can use.
When you create a layer in TF Keras, you are creating a function that takes some input and transforms it into an output you can reuse later.
Most practical applications of deep learning today are built using programming frameworks, which have many built-in functions you can simply call. Keras is a high-level abstraction built on top of TensorFlow, which allows for even more simplified and optimized model creation and training.
For the first part of this class, We shall create a model using TF Keras' Sequential API, which allows you to build layer by layer, and is ideal for building models where each layer has exactly one input tensor and one output tensor.
As you'll see, using the Sequential API is simple and straightforward, but is only appropriate for simpler, more straightforward tasks. Later in this notebook we shall spend some time building with a more flexible, powerful alternative: the Functional API.
As mentioned earlier, the TensorFlow Keras Sequential API can be used to build simple models with layer operations that proceed in a sequential order.
You can also add layers incrementally to a Sequential model with the .add() method, or remove them using the .pop() method, much like you would in a regular Python list.
Actually, you can think of a Sequential model as behaving like a list of layers. Like Python lists, Sequential layers are ordered, and the order in which they are specified matters. If your model is non-linear or contains layers with multiple inputs or outputs, a Sequential model wouldn't be the right choice!
For any layer construction in Keras, you'll need to specify the input shape in advance. This is because in Keras, the shape of the weights is based on the shape of the inputs. The weights are only created when the model first sees some input data. Sequential models can be created by passing a list of layers to the Sequential constructor.
Implement the happyModel
function below to build the following model: ZEROPAD2D -> CONV2D -> BATCHNORM -> RELU -> MAXPOOL -> FLATTEN -> DENSE
. Take help from tf.keras.layers
Also, plug in the following parameters for all the steps:
Hint:
Use tfl as shorthand for tensorflow.keras.layers
def happyModel():
"""
Implements the forward propagation for the binary classification model:
ZEROPAD2D -> CONV2D -> BATCHNORM -> RELU -> MAXPOOL -> FLATTEN -> DENSE
Note that for simplicity and grading purposes, you'll hard-code all the values
such as the stride and kernel (filter) sizes.
Normally, functions should take these values as function parameters.
Arguments:
None
Returns:
model -- TF Keras model (object containing the information for the entire training process)
"""
model = tf.keras.Sequential([
tfl.ZeroPadding2D(padding=(3,3), input_shape = (64,64,3)),
tfl.Conv2D(32, (7,7)),
tfl.BatchNormalization(axis=3),
tfl.ReLU(),
tfl.MaxPooling2D(),
tfl.Flatten(),
tfl.Dense(1, activation= 'sigmoid')
])
return model
happy_model = happyModel()
# Print a summary for each layer
for layer in summary(happy_model):
print(layer)
output = [['ZeroPadding2D', (None, 70, 70, 3), 0, ((3, 3), (3, 3))],
['Conv2D', (None, 64, 64, 32), 4736, 'valid', 'linear', 'GlorotUniform'],
['BatchNormalization', (None, 64, 64, 32), 128],
['ReLU', (None, 64, 64, 32), 0],
['MaxPooling2D', (None, 32, 32, 32), 0, (2, 2), (2, 2), 'valid'],
['Flatten', (None, 32768), 0],
['Dense', (None, 1), 32769, 'sigmoid']]
comparator(summary(happy_model), output)
Now that your model is created, you can compile it for training with an optimizer and loss of your choice. When the string accuracy
is specified as a metric, the type of accuracy used will be automatically converted based on the loss function used. This is one of the many optimizations built into TensorFlow that make your life easier! If you'd like to read more on how the compiler operates, check the docs here.
happy_model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
It's time to check your model's parameters with the .summary()
method. This will display the types of layers you have, the shape of the outputs, and how many parameters are in each layer.
happy_model.summary()
After creating the model, compiling it with your choice of optimizer and loss function, and doing a sanity check on its contents, you are now ready to build!
Simply call .fit()
to train. That's it! No need for mini-batching, saving, or complex backpropagation computations. That's all been done for you, as you're using a TensorFlow dataset with the batches specified already. You do have the option to specify epoch number or minibatch size if you like (for example, in the case of an un-batched dataset).
happy_model.fit(X_train, Y_train, epochs=10, batch_size=16)
After that completes, just use .evaluate()
to evaluate against your test set. This function will print the value of the loss function and the performance metrics specified during the compilation of the model. In this case, the binary_crossentropy
and the accuracy
respectively.
happy_model.evaluate(X_test, Y_test)
Easy, right? But what if you need to build a model with shared layers, branches, or multiple inputs and outputs? This is where Sequential, with its beautifully simple yet limited functionality, won't be able to help you.
Next up: Enter the Functional API, your slightly more complex, highly flexible friend.
Here we shall use Keras' flexible Functional API to build a ConvNet that can differentiate between 6 sign language digits.
The Functional API can handle models with non-linear topology, shared layers, as well as layers with multiple inputs or outputs. Imagine that, where the Sequential API requires the model to move in a linear fashion through its layers, the Functional API allows much more flexibility. Where Sequential is a straight line, a Functional model is a graph, where the nodes of the layers can connect in many more ways than one.
In the visual example below, the one possible direction of the movement Sequential model is shown in contrast to a skip connection, which is just one of the many ways a Functional model can be constructed. A skip connection, as you might have guessed, skips some layer in the network and feeds the output to a later layer in the network.
As a reminder, the SIGNS dataset is a collection of 6 signs representing numbers from 0 to 5.
# Loading the data (signs)
X_train_orig, Y_train_orig, X_test_orig, Y_test_orig, classes = load_signs_dataset()
The next cell will show you an example of a labelled image in the dataset. Feel free to change the value of index
below and re-run to see different examples.
# Example of an image from the dataset
index = 9
plt.imshow(X_train_orig[index])
print ("y = " + str(np.squeeze(Y_train_orig[:, index])))
In Course 2, you built a fully-connected network for this dataset. But since this is an image dataset, it is more natural to apply a ConvNet to it.
To get started, let's examine the shapes of your data.
X_train = X_train_orig/255.
X_test = X_test_orig/255.
Y_train = convert_to_one_hot(Y_train_orig, 6).T
Y_test = convert_to_one_hot(Y_test_orig, 6).T
print ("number of training examples = " + str(X_train.shape[0]))
print ("number of test examples = " + str(X_test.shape[0]))
print ("X_train shape: " + str(X_train.shape))
print ("Y_train shape: " + str(Y_train.shape))
print ("X_test shape: " + str(X_test.shape))
print ("Y_test shape: " + str(Y_test.shape))
In TensorFlow, there are built-in functions that implement the convolution steps for you. By now, you should be familiar with how TensorFlow builds computational graphs. In the Functional API, you create a graph of layers. This is what allows such great flexibility.
However, the following model could also be defined using the Sequential API since the information flow is on a single line. But don't deviate. What we want you to learn is to use the functional API.
Begin building your graph of layers by creating an input node that functions as a callable object:
Then, create a new node in the graph of layers by calling a layer on the input_img
object:
tf.keras.layers.Conv2D(filters= ... , kernel_size= ... , padding='same')(input_img): Read the full documentation on Conv2D.
tf.keras.layers.MaxPool2D(pool_size=(f, f), strides=(s, s), padding='same'): MaxPool2D()
downsamples your input using a window of size (f, f) and strides of size (s, s) to carry out max pooling over each window. For max pooling, you usually operate on a single example at a time and a single channel at a time. Read the full documentation on MaxPool2D.
tf.keras.layers.ReLU(): computes the elementwise ReLU of Z (which can be any shape). You can read the full documentation on ReLU.
tf.keras.layers.Flatten(): given a tensor "P", this function takes each training (or test) example in the batch and flattens it into a 1D vector.
If a tensor P has the shape (batch_size,h,w,c), it returns a flattened tensor with shape (batch_size, k), where . "k" equals the product of all the dimension sizes other than the first dimension.
For example, given a tensor with dimensions [100, 2, 3, 4], it flattens the tensor to be of shape [100, 24], where 24 = 2 * 3 * 4. You can read the full documentation on Flatten.
tf.keras.layers.Dense(units= ... , activation='softmax')(F): given the flattened input F, it returns the output computed using a fully connected layer. You can read the full documentation on Dense.
In the last function above (tf.keras.layers.Dense()
), the fully connected layer automatically initializes weights in the graph and keeps on training them as you train the model. Hence, you did not need to initialize those weights when initializing the parameters.
Lastly, before creating the model, you'll need to define the output using the last of the function's compositions (in this example, a Dense layer):
The words "kernel" and "filter" are used to refer to the same thing. The word "filter" accounts for the amount of "kernels" that will be used in a single convolution layer. "Pool" is the name of the operation that takes the max or average value of the kernels.
This is why the parameter pool_size
refers to kernel_size
, and you use (f,f)
to refer to the filter size.
Pool size and kernel size refer to the same thing in different objects - They refer to the shape of the window where the operation takes place.
Implement the convolutional_model
function below to build the following model: CONV2D -> RELU -> MAXPOOL -> CONV2D -> RELU -> MAXPOOL -> FLATTEN -> DENSE
. Use the functions above!
Also, plug in the following parameters for all the steps:
def convolutional_model(input_shape):
"""
Implements the forward propagation for the model:
CONV2D -> RELU -> MAXPOOL -> CONV2D -> RELU -> MAXPOOL -> FLATTEN -> DENSE
Note that for simplicity and grading purposes, you'll hard-code some values
such as the stride and kernel (filter) sizes.
Normally, functions should take these values as function parameters.
Arguments:
input_img -- input dataset, of shape (input_shape)
Returns:
model -- TF Keras model (object containing the information for the entire training process)
"""
input_img = tf.keras.Input(shape=input_shape)
Z1 = tfl.Conv2D(8, (4,4), padding='same')(input_img)
A1 = tfl.ReLU()(Z1)
P1 = tfl.MaxPool2D(pool_size=(8,8), strides=(8,8), padding = 'same')(A1)
Z2 = tfl.Conv2D(16, (2,2), padding='same')(P1)
A2 = tfl.ReLU()(Z2)
P2 = tfl.MaxPool2D(pool_size=(4,4), strides=(4,4), padding='same')(A2)
F = tfl.Flatten()(P2)
outputs = tfl.Dense(6, activation = 'softmax')(F)
model = tf.keras.Model(inputs=input_img, outputs=outputs)
return model
conv_model = convolutional_model((64, 64, 3))
conv_model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy'])
conv_model.summary()
output = [['InputLayer', [(None, 64, 64, 3)], 0],
['Conv2D', (None, 64, 64, 8), 392, 'same', 'linear', 'GlorotUniform'],
['ReLU', (None, 64, 64, 8), 0],
['MaxPooling2D', (None, 8, 8, 8), 0, (8, 8), (8, 8), 'same'],
['Conv2D', (None, 8, 8, 16), 528, 'same', 'linear', 'GlorotUniform'],
['ReLU', (None, 8, 8, 16), 0],
['MaxPooling2D', (None, 2, 2, 16), 0, (4, 4), (4, 4), 'same'],
['Flatten', (None, 64), 0],
['Dense', (None, 6), 390, 'softmax']]
comparator(summary(conv_model), output)
train_dataset = tf.data.Dataset.from_tensor_slices((X_train, Y_train)).batch(64)
test_dataset = tf.data.Dataset.from_tensor_slices((X_test, Y_test)).batch(64)
history = conv_model.fit(train_dataset, epochs=100, validation_data=test_dataset)
The history object is an output of the .fit()
operation, and provides a record of all the loss and metric values in memory. It's stored as a dictionary that you can retrieve at history.history
:
history.history
Now visualize the loss over time using history.history
:
# The history.history["loss"] entry is a dictionary with as many values as epochs that the
# model was trained on.
df_loss_acc = pd.DataFrame(history.history)
df_loss= df_loss_acc[['loss','val_loss']]
df_loss.rename(columns={'loss':'train','val_loss':'validation'},inplace=True)
df_acc= df_loss_acc[['accuracy','val_accuracy']]
df_acc.rename(columns={'accuracy':'train','val_accuracy':'validation'},inplace=True)
df_loss.plot(title='Model loss',figsize=(12,8)).set(xlabel='Epoch',ylabel='Loss')
df_acc.plot(title='Model Accuracy',figsize=(12,8)).set(xlabel='Epoch',ylabel='Accuracy')
Congratulations! You've built two models: One that recognizes smiles, and another that recognizes SIGN language with almost 80% accuracy on the test set. In addition to that, you now also understand the applications of two Keras APIs: Sequential and Functional. Nicely done!
You're always encouraged to read the official documentation. To that end, you can find the docs for the Sequential and Functional APIs here: