[OE-core] The state of reproducible Builds

Tue Jul 2 14:13:01 UTC 2019

On 7/2/19 8:26 AM, Adrian Bunk wrote:
> On Mon, Jul 01, 2019 at 10:58:04AM -0500, Joshua Watt wrote:
>> ...
>> 1. HOSTTOOLS differences. There are a lot of tools listed in HOSTTOOLS, and
>> unfortunately some of them have version dependent output and are used for
>> target builds (the one I've currently stumbled upon is pod2man, but I'm sure
>> there are others). Unfortunately, one could probably argue that HOSTTOOLS is
>> somewhat antithetical to the above statement, at least in regard to target
>> builds. Any host tool output that "leaks" into the target build output can
>> result in a non-reproducible build across hosts, and possibly should be
>> avoided; the alternative is to use (or mandate) the corresponding -native
>> recipe that provides that tool as a DEPENDS so that the controlled
>> internally built version is used instead. Note that this only really applies
>> target builds, not -native (or nativesdk right now). -native recipes would
>> obviously need more HOSTTOOLS to help bootstrap the system. I suspect this
>> would require reworking how HOSTOOLS works so that they can be split into
>> two categories somehow; the tools that have "ubiquitous and stable"
>> interfaces and are fine for all recipes (e.g. cat, sed, true, rm, etc.) and
>> those that are variable and should only be used for -native builds (e.g.
>> pod2man, rpcgen(?), chrpath(?), tar(?)... others?). Anyone have thoughts on
>> this?
>> ...
> What is the goal?
>
> 1. being able to prove that a given binary has actually been
>     built from the correct sources, or
> 2. builds on all hosts have the same output
I'm not sure there is just one goal...
> With 1. you can just record all host properties like installed packages
> and running kernel, and it isn't a problem if different hosts result in
> different output.

Right... I know that my employer would really like this sort of binary 
reproducibility; that is we should be able to pull some archived code 
out of our salt mine, build it, and know its the same binary that our 
customers have. I think if you combine what we have today and some sort 
of reproducible host image (archived Docker container, virtual machine, 
et al.) we are pretty close to that

>
> With 2. any kind of differences due to host differences is a problem.
> You need -native for nearly everything, and then fix all other kinds of
> differences like the version of the running kernel recorded somewhere.

Yes. I would hope that after using mostly -native tools where 
applicable, the currently running kernel wouldn't figure into the build 
of target packages... if it does I would venture to say that is a 
cross-compiling/reproducibility bug in the package.

Also, to be clear, I'm hoping we don't need to go so far as to say that 
-native recipes need to necessarily be reproducible; as long as they 
always generate reproducible output regardless of which host they were 
built on I suspect they don't need to be.

>
> For detecting malicous binaries not built from the claimed sources 1. is
> sufficient. For distributions like Debian that build natively this is
> even the only option available since the host compiler is used.
>
> Doing 2. would of course be more desirable, but it can also be done in
> a second step after all issues related to building on exactly the same
> host have been sorted out.

I think there are also other use cases for #2 besides detecting 
malicious binaries/source code, such as hash equivalence, or even being 
able use sstate when making a reproducible build. You are correct that 
this can be done in a second step, but I think that everyone needs to be 
aware of the limitations that will present when #2 is not present (the 
main one being that you probably can't make a reproducible build if you 
use sstate).

>
>> Joshua Watt
>> ...
> cu
> Adrian
>