@SRAZKVT you can use my fractal zip format for this purpose actually (one of the design goals was representing a dataset with changes over time). that's actually a really fascinating corollary of using it to represent atomic transitions between reproducible filesystem states....................i did not realize that VCS is precisely a formulation of atomic filesystem transactions. need to think about this further
but! there are some cool math tricks you can play around with that were inspired by the mythical quantum computer! and these generally involve lattice theory.
lattice techniques for asymmetric crypto are mostly still very new and experimental, and haven't yet reached the maturity of elliptic curve techniques. in particular, they tend to be incredibly slow, with massive key sizes. this reduces security!
the most practical lattice approach atm is kyber. and it's pretty neat!
one interesting way you can play around with lattice methods is to perform a key agreement process that merges the security of ECC (elliptic curve crypto) and lattices. signal did this for a while: https://signal.org/blog/pqxdh/
unfortunately, signal has more recently taken on a new cryptographer who keeps saying really odd shit. and he decided to move away from double ratchet and go all-in on lattice methods: https://eprint.iacr.org/2025/078.pdf.
if you scroll to page fucking SIXTY you'll see him admit that moving away from the elliptic curve crypto of the double ratchet protocol to FULL QUANTUM actually lost protection against adversarial randomness, in order to be safe from the quantum boogeyman
in fact, double ratchet was actually the first protocol to demonstrate resistance against adversarial randomness. you should read this 2020 paper on ratcheting!
in fact, this paper actually introduced adversarial randomness for the first time, discovering it as a specific property of the double ratchet cryptosystem.
isn't it cool how we can discover new theory from applied math? i love that type of shit
anyway, the reason this matters is because DUAL_EC_DRBG was a previous attempt by the NSA to break randomness generators. and since the OS mediates access to your hardware, the OS also mediates randomness.
so, in short, you also have to trust your OS to generate random bits correctly, although some specific cryptographic techniques like double ratchet can be more resilient in specific ways.
and this brings us back to trusting the OS!
asymmetric cryptography also covers signatures. these can be defined in many distinct ways, because the kind of thing you want a signature to achieve may be domain-specific or situational. but in general, a signature is a way to use secret data (from asymmetric crypto) to authenticate the result of a separate computation.
one of the standard ways to do this is to use the secret data (private key) to encrypt the result of your computation, in a way that can be inverted with the corresponding public key. this constitutes a form of "cryptographic proof".
what does an asymmetric keypair "prove"?
keep in mind that diffie-hellman (DH) secrets are an example of asymmetric crypto, but don't introduce the standard "keypair" framework. DH secrets can also be used as a a form of "proof", but not the same way as a keypair!
this sounds pedantic, but it's important to be thoughtful about words like "proof" (which have human-level meanings), and whether/how the mathematical guarantees actually translate into human-meaningful assurances.
for example, if a private key is shared among a group of users, the math is no less powerful, but the human interpretation of the mathematical result is different in a specific way!
in particular, you can't distinguish between members of the group (less powerful guarantee), but members of the group also gain a degree of anonymity when using the shared key!
effective cryptographic engineering is therefore a deeply sociological pursuit. the mathematical guarantees are unambiguous, but their human interpretation is not!
so i'm now going to simplify all of this into the context of the linux kernel module signing. recall that the stated goal was "reproducibility". let's see how we can translate that human assurance into mathematical guarantees.
we'll narrow signature schemes to this use case:
- the computation we want to "prove" is representable by serializing the state of the kernel build system directory tree.
- a filesystem tree can be unambiguously serialized by walking in order from the root, iterating directory entries in byte-lexicographic order, then writing each entry's name, mode bits, and any file data (a symlink is represented with the target as "file data").
in fact, this ends up being very similar to how tar or zip archives work, because these were intended as generic serialization formats for filesystem trees!
making these unambiguous invokes a whole separate discussion around what a checksum ends up "proving". but i do not have time to get into tree hashing of zip archives at this time or hash collision attacks on .tar.zst files--for now, we assume:
- checksums don't collide,
- the filesystem can be globally write-locked while recursively iterating,
- we can "stop" and "restart" the kernel build process at arbitrary points
(in fact, these are all things the linux operating system kernel should be able to let us do with a filesystem tree, but that's yet another topic for a separate time)
so: we can take a filesystem tree like the kernel build directory, and we can convert that to a checksum. the "reproducible builds" organization (despite lack of OS-level support for atomicity) defines "reproducibility" in these terms:
- if the kernel build directory starts at checksum C_0,
- and after the build process, the kernel build directory matches the expected checksum C_1,
then the build process is "reproducible". it produces C_1 from C_0 (after executing some vaguely-defined process or processes). this is repeated if there is sufficient information available to produce C_1 from C_0
i say "if there is sufficient information", but the reproducible builds evangelism strike force team doesn't think that way. their modus operandi is to repeatedly neg maintainers to make extremely confusing and subtle modifications to their release process for all their users.
it would only be necessary to change the build process for all users if the reproducible builds squadron believed very very deeply that the maintainer's build output is the ground truth for everyone else to reproduce.
.........which brings us to the issue at hand for the kernel.
for a representative example of how the reproducible builds evangelism strike force approaches maintainers, consider this representative example, where an arch linux package maintainer posts to the bug report mailiing list for gnu automake https://lists.gnu.org/archive/html/bug-automake/2025-08/msg00000.html
In Arch Linux our automake package includes
/usr/share/doc/automake/amhello-1.0.tar.gz. When we rebuild this package using our rebuilder to check for reproduciblity the uid/gid and timestamps are not normalized
- the arch linux package build system orchestrates the build process,
- the arch linux automake package decides to include extraneous test data in the output,
- the arch linux 'reproducibility" checker does not automatically zero out fields that are known to induce non-matching checksums,
.........so arch linux files a "bug" against automake.
to remove all doubt, another arch linux maintainer follows up: https://lists.gnu.org/archive/html/bug-automake/2025-11/msg00007.html
You don't need to worry about the value, this variable is meant to be set externally. From the reproducible-builds.org documentation, this is suggested for shell scripts on GNU systems:
(note the username kpcyrd here. he'll be coming up again soon.)
this is a very specific set of build process requirements specific to the arch linux packaging system, which our friendly neighborhood distro maintainer is able to specify with precise detail.
and this is filed as a bug upstream, because the reproducible builds evangelism strike force requires "reproducibility" in the form of a code injection API to achieve a chosen-plaintext attack.