Face Recognition Using OpenCV and Deep Learning

Kumar Sanu
6 min readOct 19, 2020

--

As we all know that Computer vision is one of the areas that’s been advancing rapidly and that’s possible due to deep learning. And Face Recognition is one of the interesting application in the field of computer vision. As I have gone through many articles, I have experienced that there are some terms like face detection, face verification, face recognition etc. that are inter-related to each other and often produce some confusion for beginners.

So through this article, I would like to explain all these terms in a very simple way and later on, we will see how to implement our own face recognition system which will recognize faces in images as well as live video stream.

Face Detection
In simple words, we can tell Face detection means to identifies human faces in digital images.

Fig. Face Detection in image

As we can see in the above picture, faces are detected and yellow box are drawn around each faces.

Face verification
Face verification problem is if you’re given an input image as well as a name or ID of a person and the job of the system is to verify whether or not the input image is that of the claimed person. So, sometimes this is also called a one to one problem where you just want to know if the person is the person they claim to be. For example, A mobile phone that unlocks using your face is also using face verification. This is a 1:1 matching problem as we are verifying that whether the input image is of the same person whose ID or name has been provided along with the image as input.
Face verification- “is this the claimed person?”

Face recognition
Face recognition problem is if you’re given an input image only and the job of the system is to identify to which person this image belongs to. For example, attendance system of employees entering the office without needing to otherwise identify themselves. This is a 1:K matching problem as we are providing one input image and system trying to identify to which person out of K persons this image belongs to or this doesn’t belongs to any of them.
Face Recognition-“who is this person?”.

One-Shot Learning
One of the challenges of face recognition is that we need to solve the one-shot learning problem. Let’s understand through example, suppose we have database of ten pictures of 10 different employees in an organization. One of them is Tom. What the system has to do is, despite ever having seen only one image of a Tom, to recognize that this is actually the same person. And, in contrast, if it sees someone that’s not in this database, then it should recognize that this is not any of the ten persons in the database. Our system has to learn from just one example to recognize the person again. The major advantage of one shot learning algorithm is we don’t need to train our convnet every time a new employee added or removed from the database.

To carry out one-shot learning, we want a neural network to learn a similarity function d, which inputs two images and outputs the degree of difference between the two images. So if the two images are of the same person, we want this to output a small number. And if the two images are of two very different people we want it to output a large number. So during recognition time, if the degree of difference between them is less than some threshold called tau, which is a hyperparameter. Then you would predict that these two pictures are the same person. And if it is greater than tau, you would predict that these are different persons. So for face recognition task, when system gets new image it will use this function d to compare this image with every images stored in database and return minimum degree of difference and on this basis, we can tell that the image belongs to which person or doesn’t belongs to any of them. In case, if we have someone new join our team, we can add that person to our database, and it just works fine.

Triplet Loss
To train neural network to learn function d, we will use triplet loss function. Here d will output difference between the encodings of the two images. What we’re going do is always look at one anchor image and then we want to distance between the anchor and the positive image, really a positive example, meaning as the same person to be similar. Whereas, we want the anchor when pairs are compared to the negative example for their distances to be much further apart.

This loss function operates on three inputs:

  1. Anchor (A) is any arbitrary data point,
  2. Positive (P) which is the same class as the anchor
  3. Negative (N) which is a different class from the anchor

Mathematically, it can be defined as: L=max(d(A,P)−d(A,N)+margin,0).

The neural networks will use gradient descent to minimize this loss, which pushes d(A,P) to 0 and d(A,N) to be greater than d(A,P)+margin. This means that neural networks will learn to output encodings for positive image approximate similar to anchor image and encodings for negative image very different from anchor image.

Implementation
Now we are familiar with the terminologies used in face recognition and seen how to train neural networks, we can move ahead to know how we can implement our own face recognition system. The whole code implementation along with explanation can be found on my github repo.

For face detection, we are using OpenCV Haar Cascade. The cascades themselves are just a bunch of XML files that contain OpenCV data used to detect objects. OpenCV comes with a number of built-in cascades for detecting everything from faces to eyes to hands to legs. We will be using face Cascade classifier in our case. The cascades are readily available on official github repo of OpenCV.

cascPath = "haarcascade_frontalface_default.xml"
# Create the haar cascade
# Detect faces in the image
faces = faceCascade.detectMultiScale(
gray,
scaleFactor=1.1,
minNeighbors=5,
minSize=(30, 30),
flags = cv2.cv.CV_HAAR_SCALE_IMAGE
)

Note:- Since these OpenCV classifiers are based on Machine Learning algorithm so we need to experiment with different values for the window size, scale factor, and so on until we found one that works best for our use.

For face recognition, we will crop faces using the face location obtained through haar cascade detector from images and resize into 96x96x3 dimensions and use FaceNet Deep Learning Model which has been already trained on millions of images using the triplet loss function as defined above. FaceNet inputs a face image (or batch of m face images) as a tensor of shape (m,nC,nH,nW)=(m,3,96,96) and outputs a matrix of shape (m,128)that encodes each input face image into a 128-dimensional vector.

So project implementation is mainly done as follows:-
1. Face Encodings:-
Input images of each person to face detector mentioned above and then using FaceNet Model to get encodings of each face. And then finally storing all face encodings in database.
2. Face recognition in Images :-
Input image to face detector to get faces in images then pass those faces to FaceNet Model to get encodings of each face. Then we will compute degree of difference for each face encodings with the encodings stored in database and return name of the person(as labels) having minimum degree of difference for each encoding if this minimum distance is less than the predefined threshold tau or else return unknown(not found in database). Finally putting boxes around the faces along with labels obtained in the images.

Output of the face recognition in an image:-

Fig. Trip to Lonavala with my friends

And we can see that it’s recognizing face correctly in the above image.

3. Live Face Recognition:-
For live face recognition, everything are same as of recognizing face in images but only difference is we are taking frames from the live video as input through OpenCV to the face detector rather than simply taking images stored in disk.

Note: For better understanding of code and implementation, kindly visit my github repo.

I hope this helps!!
Thanks for reading. As this is my first article here, any feedbacks/suggestions will be highly appreciated.

--

--