Real-Time Pose Detection in C++ using Machine Learning with TensorFlow Lite
In this post, we’re going to dive into using TensorFlow Lite in C++ for real-time human pose estimation with the help of a model downloaded from TensorFlow Hub, specifically trained for this purpose. We’ll put together an example that uses OpenCV to load a video and then processes it frame by frame to figure out the joint locations of a human body in each image. When we’re done, our application should look something like this1:
A short introduction
TensorFlow Lite is a library specially designed for deploying deep learning models on mobile, microcontrollers, and other edge devices. The main difference between TensorFlow and TensorFlow Lite is that TensorFlow is used for creating and training machine learning models, while TensorFlow Lite is a simpler version designed for running those models on devices like mobile phones.
TensorFlow Hub is a repository where we can find lots of trained machine learning models, ready to be used in our applications. For the case of our example, we will use the MoveNet.SinglePose.Lightning model, but there are many other models compatible with TensorFlow Lite in the TensorFlow Hub. Some examples include:
- Image style transfer
- Image depth estimation from monocular images
- Different types of applications of image classification like for crop-dissease, insect identification, or food classification.
- Image superresolution
Some essential concepts
Before diving into the example, let’s briefly explain some essential concepts:
- Model: A machine learning model is a mathematical representation of a real-world process, learned from data. It is used to make predictions or decisions without being explicitly programmed to perform the task.
- Inference: Inference is the process of using a trained machine learning model to make predictions or decisions based on new input data. It allows the model to apply its learned knowledge to new, unseen data.
- Tensor: A multi-dimensional array used to represent data in deep learning models. Tensors are the primary data structure used in TensorFlow to represent and manipulate data.
- Shape: The dimensions of a tensor, describing the number of elements in each dimension. For example, a matrix with 3 rows and 4 columns has a shape of (3, 4).
- Channel: In the context of image processing, channels refer to the separate color components of an image. For example, a typical color image has three channels: red, green, and blue (RGB).
Using TensorFlow Lite in your application
All the source code for this example is available in the Conan 2.0 examples repo:
There, you will find the project. Let’s have a look at the relevant files:
You can see the source code and CMakeLists.txt for our application, the video we are going to process, and the model for the neural network we will load into TensorFlow Lite. Our application runs the inference on the model (making predictions based on input data) to detect human key points, providing us with the positions of various body joints. The app is organized according to this schematic:
Loading the neural network model
As we previously mentioned, we will use the
MoveNet.SinglePose.Lightning
model in our example. This model is in the form of a .tflite
file. The first step is
loading the .tflite
model into memory. This file contains the model’s execution graph.
This model is stored in the FlatBufferModel
class, and you can create an instance of it
using the BuildFromFile
method with the model file name as the input argument.
Next, we build an Interpreter
—the class that will take the model and execute the
operations it defines on input data while also providing access to the output. To do so,
we use the InterpreterBuilder
, which will allocate memory for the Interpreter
and
manage the setup so that the Interpreter
can read the provided model. Note that before
running the inference, we tell the interpreter to allocate memory for the model’s tensors
by calling the AllocateTensors()
method. In the last line of this block, we also call
PrintInterpreterState
, a debugging utility useful for inspecting the state of the
interpreter nodes and tensors.
Read and transform the input data
Now, with our interpreter prepared to receive data and perform the inference, we must first adapt our data to match the input format accepted by the model. In this section, we’ll outline the following process:
- Read the input video (in our case, it has dimensions of 640x360 pixels).
- Crop the input video frame to create a square image (resulting in an image of 360x360 pixels).
- Resize the image to match the input accepted by the model (we’ll see that it’s 192x192) and copy it to the model’s input.
For this specific model, if we check the documentation, we can see that the input must be in the form of “an uint8 tensor of shape: 192x192x3. Channels order: RGB with values in [0, 255]”.
Although not necessary, we could access the input tensor from the interpreter to confirm
the tensor input size, which in this case is [1,192,192,3]
. The first element is the
batch size, which is 1 as we are only using one image as the input of the model.
We want to perform pose detection on a video with dimensions of 640x360 pixels, so we have to crop and resize the video frames to 192x192 pixels before inputting them into the model (we have omitted the frame capture code for simplicity, but you can find the code in the repository). To do so, we use the resize() function from the OpenCV library.
The final step is to copy the data from the resized video frame to the input of the
interpreter. We can get a pointer to the input tensor by calling typed_input_tensor
from
the interpreter.
This diagram summarizes the whole size conversion pipeline for the video frames.
Running inference
After preparing and copying the input data to the input tensor, we can finally run the
inference. This can be done by calling the Invoke()
method of the interpreter. If the
inference runs successfully, we can recover the output tensor from the model by getting
typed_output_tensor
from the interpreter.
Interpreting output
Each model outputs the tensor data from the inference in a certain format that we have to interpret. In this case, the documentation for the model states that the output is a float32 tensor of shape [1, 1, 17, 3], storing this information:
- The first two channels of the last dimension represent the yx coordinates (normalized to image frame, i.e., range in [0.0, 1.0]) of the 17 key points (in the order of: [nose, left eye, right eye, left ear, right ear, left shoulder, right shoulder, left elbow, right elbow, left wrist, right wrist, left hip, right hip, left knee, right knee, left ankle, right ankle]).
- The third channel of the last dimension represents the prediction confidence scores of each keypoint, also in the range [0.0, 1.0].
We created a draw_keypoints()
helper function that takes the output tensor and organizes
the different output coordinates to draw the pose skeleton over the video frame. We also
take the confidence of the output into account, filtering those results that have a
confidence below the 0.2 threshold.
Installing TensorFlow Lite and OpenCV dependencies and building the project
Consuming the TensorFlow Lite and OpenCV libraries using Conan is quite straightforward. If you have a look at the CMakeLists.txt of the project, it has nothing particular about Conan.
To make Conan install the libraries and generate the files needed to build the project with CMake, we simply need to create a conanfile.py that declares the dependencies for the project.
As you can see, we just declare the dependencies in the requirements()
method of the
ConanFile. We are also declaring the layout()
for the project as cmake_layout
, as we
are using CMake for building. You can check the consuming packages tutorial
section of the Conan documentation
for more information.
Now we can use Conan to install the libraries. It will not only install tensorflow-lite/2.10.0 and opencv/4.5.5, but also all the necessary transitive dependencies. For example, for a machine running macOS Ventura the whole dependency graph would look similar to this:
Conan will attempt to install those packages from the default ConanCenter remote, which is the main official repository for open-source Conan packages. If the pre-compiled binaries are not available for your configuration, you can also build from sources.
A couple of things to take into account:
- We are passing a value for the C++ standard, as the tensorflow-lite library only works with a standard higher than 17.
- We are passing the
--build=missing
argument in case some binaries are not available from the remote. - If you are running Linux and some necessary missing system libraries are missing on your
system, you may have to add the
-c tools.system.package_manager:mode=install
or-c tools.system.package_manager:sudo=True
arguments to the command line (docs reference).
Now let’s build the project and run the application. If you have CMake>=3.23 installed, you can use CMake presets:
Otherwise, you can add the necessary arguments for CMake:
Conclusions
Now that you’re familiar with the basics of using TensorFlow Lite in your applications,
you can explore other models. Additionally, having experienced the ease of installing and
using libraries like TensorFlow Lite and OpenCV, you’re now well-equipped to create more
complex applications incorporating additional libraries. To search for all libraries
available in ConanCenter, you can use the conan search '*' -r=conancenter
command.
-
Video by Olia Danilevich from https://www.pexels.com/ ↩