How does PoseNet work?

Even with camera-based input, the process of performing pose detection does not change. We start with an input image (a single still of a video is good enough for this). The image is passed through a CNN to do the first part and identify where the people are in the scene. The next step takes the output from the CNN, passes it through a pose decoding algorithm (we'll come back to this in a moment), and uses this to decode poses.

The reason we said pose decoding algorithm was to gloss over the fact that we actually have two decoding algorithms. We can detect single poses, or if there are multiple people, we can detect multiple poses.

We have opted to go with the single pose algorithm because it is the simpler and faster algorithm. If there are multiple people in the picture, there is potential for the algorithm to merge key points from different people together; therefore, things such as occlusion could mean that the algorithm detects person 2's right shoulder as person 1's left elbow. In the following image, we can see how the elbow of the girl on the right obscures the left elbow of the person in the middle:

Occlusion is when one part of an image hides another part.

The key points that are detected by PoseNet are as follows:

Nose
Left eye
Right eye
Left ear
Right ear
Left shoulder
Right shoulder
Left elbow
Right elbow
Left wrist
Right wrist
Left hip
Right hip
Left knee
Right knee
Left ankle
Right ankle

We can see where these are placed in our application. When it has finished detecting the points, we get an overlay of images, such as this:

Table of Contents for How does PoseNet work?

Create new playlist

Sign In

Sign Up

Table of Contents for
How does PoseNet work?