• tetris11@lemmy.mlOP
    link
    fedilink
    arrow-up
    5
    ·
    edit-2
    1 year ago

    Good question. Essentially it’s about reproducibility and “bootstrapping” (read: building yourself into existence).

    Reproducible Binaries

    Let’s say you use VLC media player. VLC is written in plain-text C, but it is compiled into binary machine code for your CPU, and your CPU executes it.

    When you install VLC, you technically can build it from source, but typically you download just the pre-built binary from a server and hope that the server is trusted. On an untrusted server, there is nothing stopping someone from inserting malicious code, and serving it up to you as a “VLC” binary.

    In a better world, you could run a checksum or a hash on that binary to check that that binary contains what it says it does, by simply testing the hash with the same binary from another build server. If the binaries match, they’re either both bad or both good. (Repeat with other build server until you are sure.)

    However, binaries aren’t all reproducible. During the building process, small variants creep into the binary machine code such as timestamps of when it was built. This means you can’t compare one VLC binary built by person A, and another VLC binary of the same version and of the same source and for the same CPU, but built by person B, to be the same. The hashes won’t match.

    To get around this, one needs to make reproducible builds stripped of any variant factors so that you can compare binaries.

    Bootstrapping

    Once you have reproducible binaries for VLC, you can then trust that VLC is what it says it is. No one has inserted malicious code into your binary, because your binary matches the hashes of other randomly selected VLC binaries.

    The question then is, how do I know that the program that built VLC doesn’t have malicious code?

    VLC is built with GCC (I think?), so one then needs to inspect the source of GCC, build GCC (assuming the code is good), and then use GCC to build VLC. That way you can be sure that your VLC is legit. No bad code in the VLC source, and no bad code in the GCC compiler. Assuming that your GCC is reproducible, then you can compare its hash with other pre-built GCC’s to verify that.

    Fun side diversion

    Now we enter a loop. How did we build GCC there? Well, GCC was built essentially by itself. An older version of itself was used to compile the newer version. This goes all the way back to 90s with older versions of GCC being used to build the newer version of itself. This burden of trust that no malicious code has been inserted at any point in this chain goes all the way back to the source. See the linked “trusting trust” paper.

    Back to bootstrapping

    Okay, so to recap: If I want to install VLC on my machine such that I can fully trust that no malicious code has been inserted, I need to either:

    • Have a reproducible build that I can test hashes with other trusted build servers. If they have the same hashes, then all servers would have to be compromised by malicious source code for you not to trust the binary.

    • Build VLC from source myself using a trusted compiler (GCC).

    When you build a Linux system from scratch, this trusted compiler is used. It’s a large 15MB blob of binary code that you can use to build all other software, and you just hope that its clean.

    The Guix people thought, “well, what if I want to know that it’s clean”? One solution is to dig up all the GCC’s in the past and use them to build each other up to the present, but that will take time and energy and it would be a large waste of everyone’s resources

    Instead, they introduced the concept of “seeds”, where one trusted binary can be built from other trusted smaller binaries which can be build from tiny trusted binaries, and so on and so forth, until you get down to a single executable “seed” file which is human readable and anyone can verify that the seed is the same between different systems.

    This seed file, when executed (it’s technically an ELF binary, but it’s few opcodes have human readable comments), will build the first set of tiny binaries, which in turn will build the second set of small binaries, which will finally build you the trusted GCC that will match the same hash as all others built this way.

    Now I’ve only been talking about VLC and GCC, but truthfully an entire OS needs more than that, such as bash, coreutils, binutils, and other binaries. These are known as the “bootstrap binaries”.

    Reduction in Seed size

    A reproducible Guix system initially would be seeded with a large 250MiB file of all the bootstrap binaries needed to build the entire OS.

    In 2019, the Guix people managed to deconstruct and replace some of these bootstrap binaries into a few lesser binaries (GNU Mes), making the initial deployment of Guix a 130MiB file.

    In 2020, they managed to deconstruct and replace this further into a few small binaries, making the initial deployment of Guix a 60MiB file.

    In 2023 they managed to get it down to a 357 byte file (plus one 25MB guile library).

    This is an incredible achievement towards a fully bootstrapped system. Right now the burden of trust is:

    • Seed file (albeit specific to your CPU)
    • Guile library

    If they can remove the guile dependency, then they can reduce this burden of trust even further!

      • tetris11@lemmy.mlOP
        link
        fedilink
        arrow-up
        3
        ·
        1 year ago

        All good – though I’m not sure what circular dependency could be resolved by using an older build.

        Typically you do guix package --list-available=^mypackage$ and it lists all available versions, similar to conda