Human Pose Comparison and Action Scoring using Deep Learning, OpenCV & Python

Pose Estimation is one of the more elegant applications of neural networks and is startlingly accurate and sometimes, seems like something right out of science fiction.
For Instance, check out Google's Move Mirror, an in-browser application that estimates the user's pose in real time and then displays a movie still with the actor holding the same pose.
When I glanced over it, however, I got an idea. What if the same methodology could be used to compare the same actions performed by two people? This technology could then be used to teach people over remote view! I got to work immediately and tried to reverse engineer the techniques used by Google.
The Challenge
I found, to my disappointment, a few Tensorflow.js tutorials and theories on the same and nothing in python. This was understandable as it was an in-browser application and so, I decided to port the code to mighty Python. A few sessions of research showed me that they were using Posenet, a fast yet accurate model, for estimating pose.
The Objective was simple: I wanted to go one step ahead and compare a whole action such as a punch or a kick with an image and tell me the extend to which it was correct.
The Model
As I stated earlier, Google Mirror uses PoseNet, a deep learning model which specifies 17 points on the human body. I found a good python implementation of it here.

Defining Similarity
The next challenge was defining similarity. When we think about the problem, we see that there are many uncertainties to be addressed: humans can have different heights and body shapes, they might be in different parts in the picture: one person may have been standing close to the camera, another might have been faraway. All these problems have to be solved in order to output a correct result.
Key Solutions:
- New Bounding Box: From the model output, we get the co-ordinates of the 17 key-points on a person's body. This information can be used to create a new bounding box which tightly covers the person in the picture.
- Normalization: In order to account for the size inconsistencies, we perform L2 normalization of the points in order to transform it into a unit vector.


Cosine Similarity
Now that we have standardized the pose vectors, it is time to choose a similarity measure. I chose cosine similarity for this particular instance, mainly because we are working with vectors.
The 17 key-points are converted into a vector and plotted in high dimensional space. This vector plotting is compared to another vector plot from our benchmark image. The direction of vectors here are an indication of the similarity of the poses.


Drawbacks of Initial Approach
- The algorithm does not take into account the time taken to perform the activity
- There is no scope of attaining a 100 percent score since the average is taken against one picture
Dynamic Time Warping (DTW)
Even though it sounds like a sci-fi method of time travel, it really isn't. It is just a method used in comparing sequences and graphs of different lengths. It matches the different troughs and crests in a graph using one to many matching and such, resulting in synced up frames on its own.

This method seemed ideal for my use case, as I had similar graphs for each of the 17 key-points as the action was performed. I could then use DTW to compare these graphs and get 17 scores for each of the 17 key-points. An average of these 17 scores could be then taken as the total score.
Future Improvements
- Automation: Recognition of action performed and orientation automatically
- Confidence Scores: Using confidence scores returned by Posenet for more efficient scoring
- Point Specification: According to the action to be performed, specific points can be used to increase accuracy
- Feedback System: Providing feedback to users on which body part's movement needs correction