How does PoseNet work?

Even with camera-based input, the process of performing pose detection does not change. We start with an input image (a single still of a video is good enough for this). The image is passed through a CNN to do the first part and identify where the people are in the scene. The next step takes the output from the CNN, passes it through a pose decoding algorithm (we'll come back to this in a moment), and uses this to decode poses.

The reason we said pose decoding algorithm was to gloss over the fact that we actually have two decoding algorithms. We can detect single poses, or if there are multiple people, we can detect multiple poses.

We have opted to go with the single pose algorithm because it is the simpler and faster algorithm. If there are multiple people in the picture, there is potential for the algorithm to merge key points from different people together; therefore, things such as occlusion could mean that the algorithm detects person 2's right shoulder as person 1's left elbow. In the following image, we can see how the elbow of the girl on the right obscures the left elbow of the person in the middle:

Occlusion is when one part of an image hides another part.

The key points that are detected by PoseNet are as follows:

  • Nose
  • Left eye
  • Right eye
  • Left ear
  • Right ear
  • Left shoulder
  • Right shoulder
  • Left elbow
  • Right elbow
  • Left wrist
  • Right wrist
  • Left hip
  • Right hip
  • Left knee
  • Right knee
  • Left ankle
  • Right ankle

We can see where these are placed in our application. When it has finished detecting the points, we get an overlay of images, such as this:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.178.157