An introduction to deterministic builds with C/C++
What are deterministic builds?
A deterministic build is a process of building the same source code with the same build environment and build instructions producing the same binary in two builds, even if they are made on different machines, build directories and with different names. They are also sometimes called reproducible or hermetic builds if it is guaranteed to produce the same binaries even compiling from different folders.
Deterministic builds are not something that happens naturally. Normal projects do not produce deterministic builds and the reasons that they are not produced can be different for each operating system and compiler.
Deterministic builds should be guaranteed for a given build environment. That means that certain variables such as the operating system, build system versions and target architecture are assumed to remain the same between different builds.
There are lots of efforts coming from different organizations in the past years to achieve deterministic builds such as Chromium, Reproducible builds, or Yocto.
The importance of deterministic builds
There are two main reasons why deterministic builds are important:
-
Security. Modifying binaries instead of the upstream source code can make the changes invisible for the original authors. This can be fatal in safety-critical environments such as medical, aerospace and automotive. Promising identical results for given inputs allows third parties to come to a consensus on a correct result.
-
Traceability and binary management. If you want to have a repository to store your binaries you do not want to generate binaries with random checksums from sources at the same revision. That could lead the repository system to store different binaries as different versions when they should be the same. For example, if you are working on Windows or MacOs the most simple library will lead binaries with different checksums because of the timestamps included in the library formats for these Operating Systems.
Binaries involved in the building process in C/C++
There are different types of binaries that are created during the building process in C/C++ depending on the operating system.
-
Microsoft Windows. The most important files are the ones with
.obj
,.lib
,.dll
and.exe
extensions. All of them follow the specification of the Portable Executable format (PE). This files can be analyzed with tools such as dumpbin. -
Linux. Files with
.o
,.a
,.so
andnone
(for executable binaries) extensions follow the Executable and Linkable Format (ELF). The contents of ELF files can be analyzed by readelf. -
Mac OS. Files with
.o
,.a
,.dylib
andnone
(for executable binaries) extensions follow the Mach-O format specification. These files can be inspected with the otool application that is part of the XCode toolchain in MacOs.
Sources of variation
Many different factors can make your builds non-deterministic. Factors will vary
between different operating systems and compilers. Each compiler has specific options to fix the
sources of indeterminism. To date gcc
and clang
are the ones that incorporate more options to fix
the sources of variation. For msvc
there are some undocumented options that you can try but in the
end, you will probably need to patch the binaries to get deterministic builds.
Timestamps introduced by the compiler/linker
There are two main reasons for that our binaries could end up containing time information that will make them not reproducible:
-
The use of
__DATE__
or__TIME__
macros in the sources. -
When the definition of the file format forces to store time information in the object files. This is the case of Portable Executable format in Windows and
Mach-O
in MacOs. In LinuxELF
files do not encode any kind of timestamp.
Let’s put an example of where does this information ends with a basic hello world project linking a static library in MacOs.
The library prints a message in the terminal:
And the application will use it to print a “Hello World!” message:
We will use CMake to build the project:
We build two different libraries with the exact same sources and two binaries with the same sources as well. If we build the project and execute md5sum to show the checksums of all the binaries:
We get an output like this:
This is interesting because the executables files helloA
and helloB
have the same checksums as well
as the intermediate Mach-O object files hello_world.cpp.o
but that is not the case of the .a
files.
That is because they store the information of the intermediate object files in archive format
. The
definition of the header of this format includes a field named st_time
set by a stat
system
call. If we inspect the libHelloLibA.a
and libHelloLibB.a
using otool
to show the headers:
We can see that the file includes several time fields that will make our build non-deterministic.
Let’s note that those fields are not propagated to the final executable because they have the same
checksum. This problem would also happen if building in Windows with Visual Studio but with the
Portable Executable
instead of Mach-O
.
At this point we could try to make things even worse and force our binaries to be non-deterministic as well. If we change main.cpp
file to include the __TIME__
macro:
Getting the checksums of the files again:
We see that now we have different binaries as well. We could analyze the executable file with a tool such as diffoscope that shows us the difference between the two binaries:
That shows that the __TIME__
information was inserted in the binary making it non-deterministic. Let’s see what we could do to avoid this.
Possible solutions for Microsoft Visual Studio
Microsoft Visual Studio has a linker flag /Brepro
that is undocumented by Microsoft. That flag
sets the timestamps from the Portable Executable
format to a -1
value as can be seen in the
image below.
To activate that flag with CMake we will have to add this lines if creating a .exe
:
or this for .lib
The problem is that this flag makes the binaries reproducible (regarding timestamps in the file
format) in our final binary is a .exe
but will not remove all timestamps from the .lib
(the same
problem that we talked about with the Mach-O object files above). The TimeDateStamp
field from the
COFF File Header for
the .lib
files will stay. The only way to remove this information from the .lib
binary is
patching the .lib
substituting the bytes corresponding to the TimeDateStamp
field with any known
value.
Possible solutions for GCC and CLANG
-
gcc
detects the existence of theSOURCE_DATE_EPOCH
environment variable. If this variable is set, its value specifies a UNIX timestamp to be used in replacement of the current date and time in the__DATE__
and__TIME__
macros so that the embedded timestamps become reproducible. The value can be set to a known timestamp such as the last modification time of the source or package. -
clang
makes use ofZERO_AR_DATE
that if set, resets the timestamp that is introduced in thearchive files
setting it toepoch 0
. Take into account that this will not fix the__DATE__
or__TIME__
macros. If we want to fix the effect of this macros we should either patch the binaries or fake the system time.
Let’s continue with our example project for MacOs and see what the results are when setting
ZERO_AR_DATE
environment variable.
Now, if we build our executable and libraries (omitting the __DATE__
macro in the sources), we get:
All the checksums are now the same. And analyzing the .a
files headers:
We can see that the timestamp field of the library header has been set to zero value.
Build folder information propagated to binaries
If the same sources are compiled in different folders sometimes folder information is propagated to the binaries. This can happen mainly for two reasons:
-
Use of macros that contain current file information like
__FILE__
macro. -
Creating debug binaries that store information of where the sources are.
Continuing with our hello world MacOs example let’s separate the sources so we can show the effect over the final binaries. The project structure will be like the one below.
If we build our binaries in Debug
mode.
We get the following checksums:
The folder information is propagated from the object files to the final executables making our builds non-reproducible. We could show the differences between binaries using diffoscope to see where the folder information is embedded.
Possible solutions
Again the solutions will depend on the compiler used:
-
msvc
can’t set options to avoid the propagation of this information to the binary files. The only way to get reproducible binaries is again using a patching tool to strip this information in the build step. Note that as we are patching the binaries to achieve reproducible binaries the folders used for different builds should have the same length in characters. gcc
has three compiler flags to work around the issue:-fdebug-prefix-map=OLD=NEW
can strip directory prefixes from debug info.-fmacro-prefix-map=OLD=NEW
is available sincegcc 8
and addresses irreproducibility due to the use of__FILE__
macro.-ffile-prefix-map=OLD=NEW
is available sicegcc 8
and is the union of-fdebug-prefix-map
and-fmacro-prefix-map
clang
supports-fdebug-prefix-map=OLD=NEW
from version 3.8 and is working on supporting the other two flags for future versions.
The best way to solve this is by adding the flags to compiler options. If we are using CMake
:
target_compile_options(target PUBLIC "-ffile-prefix-map=${CMAKE_SOURCE_DIR}=.")
File order feeding to the build system
File ordering can be a problem if directories are read to list their files. For example Unix does not
have a deterministic order in which readdir()
and listdir()
should return the contents of a
directory, so trusting in these functions to feed the build system could produce non-deterministic
builds.
The same problem arises for example if your build system stores the files for the linker in a container (like a regular python dictionary) that can return the elements in a non-deterministic order. This would make that each time files were linked in a different order and produce different binaries.
We can simulate this problem changing the order of files in CMake. If we modify the previous example to have more than just one source file for the library:
We can see that the results of the compilation are different if we change the order of files in the CMakeLists.txt
:
If we make two consecutive builds named A
and B
swapping sources0.cpp
and sources1.cpp
in the files list the resulting checksums will be:
Object files .o
are identical but .a
libraries and executables are not. That is because the insertion order in the libraries depends on the order the files were listed.
Randomness created by the compiler
This problem arises for example in gcc
when Link-Time
Optimizations are activated (with the -flto
flag).
This option introduces randomly generated names in the binary files. The only way to avoid this
problem is to use -frandom-seed
flag. This option provides a seed that gcc
uses when it would
otherwise use random numbers. It is used to generate certain symbol names that have to be different
in every compiled file. It is also used to place unique stamps in coverage data files and the object
files that produce them. This setting has to be different for each source file. One option would be
to set it to the checksum of the file so the probability of collision is very low. For example in
CMake it could be made with a function like this:
Some tips using Conan
Conan hooks can help us in the process of making our builds reproducible. This feature makes it possible to customize the client behavior at determined points.
One use of hooks could be setting environment variables in the pre_build
step. The example below is
calling a function set_environment
and then restoring the environment in the post_build
step with
reset_environment
.
Hooks can also be useful to patch binaries in the post_build
step. There are different binary files
analysis and patching tools like ducible,
pefile, pe-parse
or strip-nondeterminism. An
example of a hook for patching a PE
binary using ducible could be like this one:
Conclusions
Deterministic builds are a complex problem highly coupled with the operating system and toolchain used. This introduction should have served to understand the most common causes of indeterminism and how to avoid them.
References
General info
- https://www.chromium.org/developers/testing/isolated-testing/deterministic-builds
- https://reproducible-builds.org/
- https://wiki.yoctoproject.org/wiki/Reproducible_Builds
- https://stackoverflow.com/questions/1180852/deterministic-builds-under-windows
- https://docs.microsoft.com/en-us/windows/win32/debug/pe-format#archive-library-file-format
- https://devblogs.microsoft.com/oldnewthing/20180103-00/?p=97705
- https://www.geoffchappell.com/studies/msvc/link/link/options/brepro.htm?tx=37&ts=0,267
Tools
Tools for comparing binaries
- https://diffoscope.org/
- https://docs.microsoft.com/en-us/windows-server/administration/windows-commands/fc
Tools for patching files
- https://salsa.debian.org/reproducible-builds/strip-nondeterminism
- https://github.com/erocarrera/pefile
- https://github.com/trailofbits/pe-parse
- https://github.com/smarttechnologies/peparser
- https://github.com/google/syzygy
- https://github.com/nh2/ar-timestamp-wiper