I've seen a lot of confusion lately about reproducible builds and docker, probably related to the recent SolarWinds supply-chain attack, so I thought I'd try to clear up some common misconceptions. A thread!
Reproducible builds are a way to build software that can be reproduced. This means if you build a binary twice, they should be byte-for-byte equal. You generally check this with a hash function like sha256 or the diff command.
At a high level, this means your build process is a pure function of the inputs. If the source code input to your build is the same, the binary output from your build should be the same.
In practice, tools tend to leave timestamps or machine-dependent info like hostnames everywhere. Making a build reproducible is an exercise in hunting these down and passing the right flags to your tools to strip them out.
This is where docker comes in!
Docker makes it easy to set up a reproducible environment. A container always has the same tool versions and FS layout. This eliminates a lot of non-reproducible cruft, but not all. I'm looking at you, timestamps! (gzip -N is your friend here)
Docker makes it easy to set up a reproducible environment. A container always has the same tool versions and FS layout. This eliminates a lot of non-reproducible cruft, but not all. I'm looking at you, timestamps! (gzip -N is your friend here)
But what about "docker build" itself? A docker build is just a series of containers run in order. If each container is reproducible, the overall build should be too, right? Nope. Timestamps again! These appear in two places: files and the image itself:
1. Each file in a layer has its own set of timestamps: mtime/atime/ctime. These are not part of the file itself, but do become part of the tarball that becomes the image layer, so they affect the hash of it. You can see these with the "stat" command inside a container.
2. The overall image itself has a timestamp. This reflects when it was created, and also is included in the final hash. You can see this with the "docker images" command or in your registry UI.
Also: note that you can also still end up with non-reproducibility in your files depending on how they were generated in the Dockerfile. Docker can't make "curl | bash" reproducible automatically for you!
Docker does use some clever caching tricks to make your builds seem reproducible. If nothing has changed, docker will skip that step the next time you build. That means you get all the old timestamps and hash again. Clearing the cache will give you a new hash.
You can try this out with a simple one-liner or any Dockerfile you have: https://gist.github.com/dlorenc/247222f619d78d070d574c7f1fd7d688
To summarize: docker can make it easier to build your stuff reproducibly, but docker build itself is not reproducible. Other tools like jib, bazel, kaniko and ko can do reproducible container builds.
Maybe buildpacks can too? @jonjonsonjr @mattomata @ImJasonH
Maybe buildpacks can too? @jonjonsonjr @mattomata @ImJasonH
The google/go-containerregistry library has a function to help strip out timestamps too, here: https://github.com/google/go-containerregistry/blob/5e45177e606652caa75af24f2ed4edee7feefb05/pkg/v1/mutate/README.md#L28
Also see this great blog post from @taviso on why you might not need reproducible builds: https://blog.cmpxchg8b.com/2020/07/you-dont-need-reproducible-builds.html
I'll just add that you might still want them for reasons other than for security, but all of his points are excellent.
I'll just add that you might still want them for reasons other than for security, but all of his points are excellent.