Leverage deep learning using AWS Rekognition and surveillance cameras to analyze customer satisfaction and emotions from sales representatives interactions
- Intro
- Prerequisites
- Aggregating faces and facial expressions (emotions) data from a video
- Extracting employees (sales representatives) from the general pool of faces
- Detecting whether a customer interacting with an employee
- Calculate employee performance
Intro
Imagine a bustling retail store with customers browsing, interacting with products, and engaging with staff.
What if the store could gain insights into how customers feel during these interactions and identify the most effective sales representatives?
By leveraging surveillance cameras and AWS Rekognition, stores can analyze and gain insights into customer satisfaction and behavior.
Correctly analyzing customer emotions can provide valuable insights into which areas should be improved to increase both customer satisfaction and sales.
AWS Rekognition is a powerful deep learning tool from AWS that can analyze images and videos.
It can be used to detect and track faces, analyze facial expressions and facial attributes, detect objects and landmarks, and many other capabilities.
This is just one of many use cases for AWS Rekognition; it can also be used for security surveillance, content moderation, user engagement analysis, and many other applications.
I will use Python with Boto3, but AWS has libraries for nearly every major programming language.
A quick yet important note: this is not production-ready code.
It lacks error validation, scalability and efficiency considerations, testability, and other best practices that were omitted for simplicity’s sake.
Additionally, the algorithm is simplified since the main goal is to demonstrate AWS Rekognition’s capabilities and not to provide a production-ready solution.
Think of it as a hackathon-level code and algorithm.
Prerequisites
- Make sure you have the
boto3
andcv2
Python libraries installed. - You need an AWS account and an
IAM user
with the necessary permissions to use Rekognition andS3
. - You need an
S3 bucket
to store the video file from the camera. Although it is possible to create real-time analysis with real-time streaming, for simplicity, we will use a simple video file. - Upload the video file you want to analyze to the
S3 bucket
. - A local folder with pictures (e.g., employee tags) of sales representatives that will be used to identify them in the video. For simplicity, the file name will indicate the employee’s name.
Aggregating faces and facial expressions (emotions) data from a video
The first step is to analyze the entire video and extract all the faces and facial expressions from it. This data will serve as the basis for further analysis and insights. To achieve that, we will follow these steps:
- Create a face detection job and collect the results
- Aggregate the detected faces for a given timestamp
Detecting faces and collecting results
Now, we need to monitor the job and collect the output.
In a real-world scenario, you would probably use a more refined method than the one in this example.
For instance, the Rekognition API allows requesting notifications via the NotificationChannel
parameter, so you can set an SNS topic
to be notified when the job is completed.
This can trigger a Lambda function
to process the results.
In our case, for simplicity, we will just create a function to poll the job status every 10 seconds until it is completed (or failed).
After the job is done, the result might be paginated.
In this case, you’ll receive a NextToken
in the response, which you’ll need to use to page through the results.
Let’s create a function for that:
faces
will contain an array of detected faces in a video frame, with the timestamp of the frame:
At this point, let’s ignore what each property means; we will discuss it later.
Aggregating detected faces for a given timestamp
Now, let’s aggregate the face objects per timestamp.
Now aggregated_faces
will contain array of faces objects per timestamp.
Extracting employees (sales representatives) from the general pool of faces
Before we can analyze customer satisfaction related to the sales representatives, we need a way to differentiate between the sales representatives and the customers. Here is the process to do this:
- Create a
collection
inRekognition
. - Index our employee pictures to the collection.
- Run a face search on this collection to identify the employees in the video.
- Tag employees in the generic faces (
aggregated_faces
) collection object.
Creating a collection and indexing employees pictures
Detecting employees in the video
Now, let’s run a face search on the video in the S3 bucket
to identify the employees in the collection.
The process is similar to the previous face detection we created.
- Create a job
- Monitor the status
- Once done, paginate through the results
Let’s start by defining a function for the employees’ face search:
Now, let’s monitor the job status similar to what we previously did:
And finally iterate through the pages of the result with NextToken
:
Now, employees
variable will contain an array of Person
objects:
Aggregating employees per timestamp
Now, similar to what we did with all the faces, let’s aggregate the employees per timestamp.
As a quick reminder, in production code you would likely create a more generic function since although the APIs are different between the two jobs, the logic is the same.
However, for the sake of simplicity let’s keep it separate.
Now aggregated_employee_faces
will contain array of employee faces objects per timestamp.
Tagging employees in the generic faces pool
Now that we have the employee faces, we can tag them in the generic faces pool by correlating bounding boxes between identical timestamps.
But before writing the code, let’s understand what a bounding box is.
Bounding boxes
In computer vision, a bounding box is a rectangle representing an object’s location and size in an image or a video.
In most cases, bounding boxes are represented either by the top left (x1, y1) and bottom right (x2, y2) corners, or by the top and left positions alongside the width and height of the box.
Bounding box comparison
Unfortunately, at the time of writing this post, unlike get_face_search, get_face_detection does not return a face ID.
Thus, our only way to tag known faces from the generic faces pool, which also includes customers, is by comparing bounding boxes in each time frame.
To tag the employees in the generic faces pool, we need to iterate through the generic faces and compare the bounding boxes with the employee faces.
At this point, the aggregated_faces
object will contain the employee
key for each face that was detected as an employee.
Now, we can move on to implementing an algorithm to detect whether a customer is interacting with an employee, and if so, whether the customer is happy.
Detecting whether a customer interacting with an employee
In order to determine whether a custom is interacting with an employee, multiple conditions should be met:
- The customer and the employee are in the same time frame
- The customer and the employee are in proximity in 2d
- The customer and the employee are in proximity in 3d (depth)
- The customer is facing the employee to a certain degree
- The customer or the employee are talking
Determine 2D proximity - Euclidian distance
One of the indicators of proximity in 2D is the distance between the customer and the employee based on the bounding boxes.
To determine proximity, we can calculate the center of each bounding box and measure the distance between the two centers using the Euclidean distance formula.
In order to determine the x
and y
coordinates of the center of a bounding box, the following formulas can be used:
After calculating the center of each bounding box, we need to calculate their distance.
If we think of the image as a 2D plane, the distance between two points can be calculated using the Euclidean distance formula:
Let’s implement this in code:
We will soon add additional conditions to determine whether the employee and the customer were interacting,
however for now let’s clarify that the output of the main iteration will be employee_interactions
objects with contents that looks like this:
Based on this object, we can examine and improve employee interactions. However, before that, let’s add additional conditions to more precisely determine whether the employee and the customer are interacting.
Yaw and Pitch degrees to determine whether the customer is facing the employee
Of course there is no guarantee that two people facing each other are interacting, nor does it mean that if they are not facing each other they are not interacting. However, it’s a reasonable indicator and we will use it to increase our confidence in the interaction detection.
Do determine whether the customer is facing the employee, we can use the Pose
object returned by the get_face_detection
API,
or more precisely the Yaw
and Pitch
degree values, which are specified in degrees
and ranging from -180
to 180
.
The API also returned a Roll
value, but we will not use it in this case.
To demonstrate the concept, I’ve combined and edited nice illustrations by Tsang Ing Ren and Yu Yu from ResearchGate.
Now that we understand the concept of Yaw and Pitch, we can use it to determine whether the customer is facing the employee:
After that, we will add are_facing_each_other
to be part of the are_interacting
function:
Determine 3D proximity - BoundingBoxes size
Alright, so now we’ve determined the proximity of the bounding boxes in 2D and whether the customer is facing the employee. Next, we need to determine whether the bounding boxes are close enough in 3D space. For example, the following bounding boxes are close in 2D, but not in 3D:
To determine the 3D proximity, we can use the size of the bounding boxes and simply compare the width and the height of the box:
After that, we will add are_in_3d_proximity
to be part of the are_interacting
function:
Indication of talking - MouthOpen
Like the other features, an open mouth does not guarantee that the person is talking, but it’s an additional indicator of possible interaction.
When combined with the other features we checked, it can increase our confidence that there is an interaction between the employee and the customer.
Alongside BoundingBox
, Pose
, and Emotions
, the get_face_detection
API also returns a MouthOpen value, which indicates whether the mouth is open or not.
Let’s write a short code snippet to add this to our conditions:
Completing the are_interacting
function
Now that we have all the conditions, we can complete the are_interacting
function:
Calculate employee performance
As explained in previous sections, now employee_interactions
object should contain employee interactions list and customer emotions per each interaction,
and it should look roughly like this:
Now we can calculate the final performance score for each employee (explanations are in the code comments):
The output of employee_scores
will be a dictionary with the employee names as keys, performance score and supporting data as values:
Now, one can use this information as actionable insights to improve employee performance and customer satisfaction.
I did clarify this at the start of the post, but this is not production ready code as it is too simplified and lacks error handling, scaling considerations and other important aspects of
production code.
However, this post gives a good starting point in using AWK Rekognition.