Custom AI Face Recognition With Keras and CNN

Arnaldo P. Castaño

5.00/5 (4 votes)

Aug 5, 2020

CPOL

7 min read

20265

586

In this article we’ll put together our CNN and train it for face recognition.

Download source - 8.4 KB

Introduction

If you’ve seen the Minority Report movie, you probably remember the scene where Tom Cruise walks into a Gap store. A retinal scanner reads his eyes, and plays a customized ad for him. Well, this is 2020. We don’t need retinal scanners, because we have Artificial Intelligence (AI) and Machine Learning (ML)!

In this series, we’ll show you how to use Deep Learning to perform facial recognition, and then – based on the face that was recognized – use a Neural Network Text-to-Speech (TTS) engine to play a customized ad. You are welcome to browse the code here on CodeProject or download the .zip file to browse the code on your own machine.

We assume that you are familiar with the basic concepts of AI/ML, and that you can find your way around Python.

In this article, we’ll discuss CNNs, then design one and implement it in Python using Keras.

What is a CNN?

A CNN is a type of Neural Network (NN) frequently used for image classification tasks, such as face recognition, and for any other problem where the input has a grid-like topology. In CNNs, not every node is connected to all nodes of the next layer; in other words, they are not fully connected NNs. This helps prevent overfitting issues that come up in fully connected NNs, not to mention extra-slow convergence that results from too many connections in a NN.

The concept of CNN relies on a mathematical operation known as convolution, which is very common in the field of Digital Signal Processing. Convolution is defined as a product of two functions – a third function – that expresses the amount of overlap between the first two functions. In the area of CNN, convolution is achieved by sliding a filter (a.k.a. kernel) through the image.

In face recognition, the convolution operation allows us to detect different features in the image. The different filters can detect the vertical and horizontal edges, texture, curves, and other image features. This is why one of the first layers in any CNN is a convolutional layer.

Another layer common in CNNs is the pooling layer. Pooling is used to reduce the size of the image representation, which translates to reduction in the number of parameters and, ultimately, the computational effort. The most common type of pooling is "max," which uses a sliding window - similar to the one in the convolution operations – to harvest, in every location, the maximum value from the group of cells being matched. At the end, it builds a new representation of the image from the harvested maximum values.

The most common CNN architectures typically start with a convolutional layer, followed by an activation layer, then a pooling layer, and end with a traditional fully connected network such as a Multi-Layer NN. This type of model where layers are put one after the other is known as Sequential. Why a fully connected network at the end? To learn a non-linear combination of features in the transformed image (after convolution and pooling).

Design a CNN

Here is the architecture we’ll implement in our CNN:

Input layer – a NumPy array (img_width, img_height, 1); "1" because we are dealing with grayscale images; for RGB images, it would have been (img_width, img_height, 3)
Conv2D layer – 32 filters, filter size of 3
Activation layer – must use a nonlinear function for learning, in this case, the function is ReLU
Conv2D layer – 32 filters, filter size of 3, stride of 3
Activation Layer using the ReLU function
MaxPooling2D layer – applies the (2, 2) pooling window
DropOut layer, at 25% – prevents overfitting by randomly dropping some of the values from the previous layer (setting them to 0); a.k.a. the dilution technique
Conv2D layer – 64 filters, filter size of 3
Activation layer using the ReLU function
Conv2D layer – 64 filters, filter size of 3 and, stride of 3
Activation layer using the ReLU function
MaxPooling2D layer – applies the (2, 2) pooling window
DropOut layer, at 25%
Flatten layer – transforms the data to be used in the next layer
Dense layer – represents a fully connected traditional NN
Activation layer using the ReLU function
DropOut layer, at 25%
Dense layer, with the number of nodes matching the number of classes in the problem – 15 for the Yale dataset
Activation layer using the ReLU function

The above architecture is pretty common; layer parameters had been fine-tuned experimentally.

Implement the CNN

Now let’s implement our CNN architecture – the set of layers we’ve selected – in code. To create an easy-to-extend solution, we’ll use the ML model with a set of abstract methods:

class MLModel(metaclass=abc.ABCMeta):

    def __init__(self, dataSet=None):
        if dataSet is not None:
            self.objects = dataSet.objects
            self.labels = dataSet.labels
            self.obj_validation = dataSet.obj_validation
            self.labels_validation = dataSet.labels_validation
            self.number_labels = dataSet.number_labels
            self.n_classes = dataSet.n_classes
        self.init_model()

    @abstractmethod
    def init_model(self):
        pass

    @abstractmethod
    def train(self):
        pass

    @abstractmethod
    def predict(self, object):
        pass

    @abstractmethod
    def evaluate(self):
        score = self.get_model().evaluate(self.obj_validation, self.labels_validation, verbose=0)
        print("%s: %.2f%%" % (self.get_model().metrics_names[1], score[1] * 100))

    @abstractmethod
    def get_model(self):
        pass

In our case, dataset is an instance of the FaceDataSet class described in the previous article of this series. The ConvolutionalModel class, which inherits from MLModel and implements all its abstract methods, is the one that will contain our CNN architecture. Here it is:

class ConvolutionalModel(MLModel):

    def __init__(self, dataSet=None):
        if dataSet is None:
            raise Exception("DataSet is required in this model")
        self.shape = numpy.array([constant.IMG_WIDTH, constant.IMG_HEIGHT, 1])
        super().__init__(dataSet)
        self.cnn.compile(loss=constant.LOSS_FUNCTION,
                         optimizer=Common.get_sgd_optimizer(),
                         metrics=[constant.METRIC_ACCURACY])

    def init_model(self):
        self.cnn = Sequential()

        self.cnn.add(Convolution2D(32, 3, padding=constant.PADDING_SAME, input_shape=self.shape))
        self.cnn.add(Activation(constant.RELU_ACTIVATION_FUNCTION))
        self.cnn.add(Convolution2D(32, 3, 3))
        self.cnn.add(Activation(constant.RELU_ACTIVATION_FUNCTION))
        self.cnn.add(MaxPooling2D(pool_size=(2, 2)))
        self.cnn.add(Dropout(constant.DROP_OUT_O_25))

        self.cnn.add(Convolution2D(64, 3, padding=constant.PADDING_SAME))
        self.cnn.add(Activation(constant.RELU_ACTIVATION_FUNCTION))
        self.cnn.add(Convolution2D(64, 3, 3))
        self.cnn.add(Activation(constant.RELU_ACTIVATION_FUNCTION))
        self.cnn.add(MaxPooling2D(pool_size=(2, 2)))
        self.cnn.add(Dropout(constant.DROP_OUT_O_25))

        self.cnn.add(Flatten())
        self.cnn.add(Dense(constant.NUMBER_FULLY_CONNECTED))
        self.cnn.add(Activation(constant.RELU_ACTIVATION_FUNCTION))
        self.cnn.add(Dropout(constant.DROP_OUT_0_50))
        self.cnn.add(Dense(self.n_classes))
        self.cnn.add(Activation(constant.SOFTMAX_ACTIVATION_FUNCTION))
        self.cnn.summary()

    def train(self, n_epochs=20, batch=32):
        self.cnn.fit(self.objects, self.labels,
                       batch_size=batch,
                       epochs=n_epochs, shuffle=True)

    def get_model(self):
        return self.cnn

    def predict(self, image):
        image = Common.to_float(image)
        result = self.cnn.predict(image)
        print(result)

    def evaluate(self):
       super(ConvolutionalModel, self).evaluate()

In the constructor, we set the self.shape variable, which defines the shape of the input layer. In our case, with the Yale dataset images 320 pixels tall and 243 pixels wide, self.shape=(320, 243, 1).

We then call super() to get all dataset-related variables set from the parent constructor, as well as to call the init_model() method that initializes the model.

Finally, we call the compile method, which configures the model for training and sets the objective function to use in the loss parameter. The objective function is optimized – minimized or maximized – during the training process. The accuracy parameter defines the metric to evaluate the model during training. The optimizer parameter defines the way the weights are calculated; the most common optimizer is Gradient Descent.

Our CNN model is defined as Sequential, with all layers added as the architecture requires. The train() method uses the fit method of the sequential class, representing an arrangement of layers, to train CNN. This method receives as input the data to train the CNN on, the correct classification of this data, and some optional parameters such as the number of epochs to run.

Train the CNN

Now the code is ready – time to train our CNN. Let’s instantiate the ConvolutionalModel class, train on the Yale dataset, and call the evaluate method.

cnn = ConvolutionalModel(dataSet)
cnn.train(n_epochs=50)
cnn.evaluate()

After running the training for 50 epochs, we got to the accuracy of almost 85% on the test images.

This means that our CNN will now recognize each one of the 15 subjects in the dataset with a probability of 85%. Not bad for a brief exercise, eh?

Now that we trained our CNN, if we would like to predict new incoming data, meaning new faces from images, we can do it using the predict(image) method from the ConvolutionalModel class detailed before. How would it work? The call would look like the next one and it should comply with certain assumptions.

cnn.predict(np.expand_dims(image, axis=0))

First, the input image needs to have the same dimensions or shape as the input layer of the CNN that was previously trained. Second, it should be the same type of input, i.e. a matrix of pixel values, inside the predict() method we normalize data so there is no need to provide a normalized image pixel matrix. Third, we may need to add a dimension to the inputted face image, since in the trained CNN we considered a 4th dimension for the number of samples in the dataset. This can be achieved using the expand_dims() method of numpy. Fourth, it is assumed that a face image will be provided, in case of a larger picture the face detection method provided in previous articles can prove useful.

Finally, the output of the predict() method can be seen in the previous figure. This method will output a probability of the face belonging to each of the possible classes or individuals (15 for the trained dataset). In this case we can see that the highest probability will be for class 4, which is precisely the class or person the inputted face image refers to.

Next Step?

Now we know how to build our own CNN from scratch. In the next article, we’ll investigate an alternative approach – utilizing a pre-trained model. We’ll take a CNN that had been previously trained for face recognition on a dataset with millions of images – and adapt it to solve our problem. Stay tuned!