Serializing your data with Protobuf
In this blog post, we will discuss the Protobuf (Protocol Buffers), a project that can extend more than a simple library for serialization. The entire example presented here is available on Github.
⚠ (May 24, 2023) This blog has been updated and is working with Conan 2.x. Plus, the protobuf syntax has been updated to version 3. The older post with Conan 1.x and protobuf version is archived here.
You probably already had to develop a project where you needed to exchange information between processes or even across different machines with different processor architectures. One well-known technique in this scenario is serialization, which is summarized in the translation of data structures or object state into a format that can be stored and retrieved by the both sides.
What is Protobuf?
Protocol Buffers is an open source project under the BSD 3-Clause license, a popular one developed by Google, to provide a language-neutral, platform-neutral and extensible mechanism for serializing structured data. It supports many popular languages such as C++, C#, Dart, Go, Java and Python. Although there are still other not official add-ons, that support other languages, such as C. You can find the source code on Github, where its popularity reaches almost 32K stars!
The neutral language used by Protobuf allows you to model messages in a structured format through .proto files:
In the example above we use a structure that represents a person’s information, where it has mandatory attributes, such as name and age, as well as having the optional email data. Mandatory fields, as the name already says, must be filled when a new message is constructed, otherwise, a runtime error will occur.
But Why not XML?
But, why another language and serialization mechanism if we can use something already available like XML? The answer is performance.
Protobuf has many advantages for serialization that go beyond the capacity of XML. It allows you to create a simpler description than using XML. Even for small messages, when requiring multiple nested messages, reading XML starts to get difficult for human eyes.
Another advantage is the size, as the Protobuf format is simplified, the files can reach 10 times
smaller compared to XML. But the great benefit is its speed, which can reach 100 times faster than
the standard XML serialization, all due to its optimized mechanism. In addition to size and speed,
Protobuf has a compiler capable of processing a .proto
file to generate multiple supported
languages, unlike the traditional method where it is necessary to arrange the same structure in
multiple source files.
That sounds good, but how do I use it in real life?
So that we can illustrate the use of Protocol Buffers, we will exchange messages through different architectures and opposite languages. We will compile a code in C++ for armv7hf architecture, serialize an object to file, and retrieve through a Python script. An advantageous model for those who need to exchange messages between opposing architectures through IPC techniques, even for embedded systems.
For our example, we will use a message that has the reading of several sensors. The file sensor.proto, which will represent the message, is described below:
The variable syntax refers to the version of the Protobuf used, which can be proto2 or proto3. Versions 2 and 3 have important differences, but we will only address version 3 in this post. For more information about version 2, see the official documentation. In addition to the declared attributes, and previously highlighted there is the enumerator SwitchLevel, which represents the state of a port. We could still include new messages, or even lists for multiple ports, for example. For a complete description of the syntax used in proto version 3, see the language guide.
The Protobuf serialization mechanism is given through the protoc
application, this compiler
will parse the .proto
file and will generate as output, source files according to the
configured language by its arguments, in this case, C++. You can also obtain more information
about, reading the section compiler invocation.
The protoc
compiler will generate the sensor.pb.h
and sensor.pb.cc
files, respectively,
of which have the getters and setters needed to access the attributes, as well as methods for
serializing and parsing. The files work only as a stub, and it is necessary to include the headers
distributed by Protobuf. Without this compiler, we would have to describe all the steps of object
serialization in our code, and for any new change, it would be needed to update the C++ and Python
files.
Now that we have the stubs, we can implement an example to serialize the data collected by a sensor.
The file main.cpp
will be described below:
The Sensor object can be serialized through methods inherited from the Message class. For example, we can serialize to a string by the SerializeAsString method.
Note that this reconstruction can be performed by other languages also supported by Protobuf, in addition to other architectures. In order for the transmission to occur through different processes, it will be necessary to use IPC techniques, for this, Google provides the gRPC project, a universalRPC framework, that supports Protobuf directly. However, our intention in this post is just to talk about Protobuf, so we will use the only text file as a means to exchange messages between processes:
To perform serialization through a file, we use the SerializeToOstream method.
Building the project
For the next step, we will describe the actions for constructing the project by CMake:
This recipe searches for the modules, libraries, and macros provided by the Protobuf project when
calling find_package. Once found and
loaded correctly, protobuf_generate
macros will be available for use. The
protobuf_generate_cpp function is responsible for executing the
protoc
and populating the PROTO_SRCS
and PROTO_HDRS
variables with their generated
files. Without this functionality, you would need to manually add the protoc
command and the
required arguments. The subsequent lines follow the most usual of CMake projects. Because the
generated files will be in the build directory, you need to include it by
target_include_directories so that main.cc
can resolve proto.pb.h
.
It is also possible to observe that we are using Conan to solve Protobuf as a
dependency. The CMakeDeps
function will be in charge of generating the file FindProtobuf.cmake
, that contains all the necessary variables,
besides providing the target protobuf::protobuf
.
In addition, you must also declare the conanfile.txt file with the following dependencies:
Since Protobuf can be divided into two parts, the protoc executable
, and the libraries, we will add the same package as requires
and tool_requires
, so
it will be possible to install protoc
for the same host architecture, as a build requirement, and libraries for a target architecture (aarch64) as a regular requirement.
As we are using CMake for this project, we need to declare
the CMake generators CMakeDeps and CMakeToolchain. The CMakeDeps
generator will be responsible for generating the FindProtobuf.cmake
file, and the CMakeToolchain
generator will be responsible for generating the conan_toolchain.cmake
file, which will be used by CMake to configure the project.
Plus, we declared the layout cmake_layout that will be responsible for organizing the files in the build directory.
Now just run the commands to build the project, in case you are using Linux or MacOS:
So far so good, but how is it done in case of cross compilation? In this case, it will be necessary to inform the compiler and the target platform:
In the above commands, Conan has installed Protobuf for the host architecture, and also for the build architecture.
Both build and host profiles
are needed to perform cross compilation. The build
profile will be used to install Protobuf for amd64, the machine architecture that is being used to build the project, and the host
profile will be used to install Protobuf for armv8, the target architecture. The CMAKE_CXX_COMPILER
variable will be used to inform the compiler that will be used to compile the project, in this case, the compiler for armv8.
Parsing with Python
Now we get to the second step, read the file and retrieve the object using Python. For this, we will only update the CMake script, so that it generates the C++ files and also the python stub:
The protobuf_generate_python
function has the same goal as protobuf_generate_cpp
but will
generate the file sensor_pb2.py
. The proto_python
virtual target was added to force CMake
to call the generator for Python.
The next step is to develop the script that will read the file with the serialized data and parse it through the script generated in the previous step:
The script is fairly straightforward, just like the code in C++ and can be copied together with the
sensor_pb2.py
file directly to the target platform.
Conclusion
Transfer data between processes, serializing objects or even storing data are techniques that are widely used in all scenarios, but they require a lot of effort when implemented and are often not the goal of the project under development. Serialization techniques can be solved through several projects available, such as Protobuf, without having to delve into the low level required to process all the data.
The success in using Protobuf is not only in serializing the data, but in the mechanism as a whole, from the neutral language used, flexible and easy to understand, to the compiler with support for multiple languages, and even integration with other products, such as the gRPC, which provides direct communication between processes without much effort.
This post blog was a tutorial to demonstrate how tasks that could take up to hours to complete, with library development, can be solved in a few steps, only using what is ready and without the need to build from the sources.
Interested in knowing more or commenting on the subject? Please do not hesitate to open a new issue.