@SRAZKVT you can use my fractal zip format for this purpose actually (one of the design goals was representing a dataset with changes over time). that's actually a really fascinating corollary of using it to represent atomic transitions between reproducible filesystem states....................i did not realize that VCS is precisely a formulation of atomic filesystem transactions. need to think about this further
@SRAZKVT i was probably so so close to this exact epiphany when i tried to prototype sharing the twitter git FUSE object db with the pants object store. i couldn't understand what the hell a commit was and ended up making this i/o virtualization framework instead https://github.com/cosmicexplorer/upc/blob/master/local/memory/LibMemory.scala
@SRAZKVT thank you aaaaaaaaaaa
@SRAZKVT one thing i've noticed is that OS research has this idea of a "checkpoint" like a save state but doesn't have any concept of like a delta between states. checkpoints are just reset points. filesystems have snapshots over time like zfs but they're not like a way to replay the result of a process execution, they just use generic tree diffing algorithms and consider it an optimization. there's no idea of reproducible semantics
@SRAZKVT this also bleeds into "reproducible builds" which negs maintainers to make highly questionable changes to their build processes so the checksum matches but just whimpers and cries and calls it impossible to support secret data like kernel module signatures
@SRAZKVT i like basically everyone who works on guix and i think they'd be open to this kind of argument
@SRAZKVT interestingly, this represents a novel conception of "swappability" which is my little neologism for "reproducibility that transfers across dependency graphs". it's something that i argue spack provides with its logic graph that lets you swap out package dependencies and expect things to still work--i didn't realize it could also apply to filesystem state transitions like this. so that's another argument for the excessive narrowness of "reproducibility"--guix and nix use checkpoint/checksum reproducibility to conflate both package build reproducibility and filesystem i/o state reproducibility
@SRAZKVT one really good point you made a while back that stuck with me was observing how odd it was that git calculates diffs on the fly all the time (and subsequently recognizing the confusion between printing a "commit" as a delta, but using a "commit" checksum to represent a discrete filesystem state). if the commit is signed, is that signature for the delta or for the resulting filesystem state? and iirc we agreed that a contributor would sign a delta, but a maintainer would sign a state (e.g. after merging a PR to main)
@SRAZKVT and interestingly enough, tree hashing is a way to share an intermediate checksum computation for an individual element of a sequence in a way that enables reordering the sequence and efficiently regenerating a checksum for the whole sequence (SHA* can't do this, BLAKE* can). so that actually distinguishes the checksum computed for a sequence of commit deltas from the checksum for the resulting filesystem tree. and that's interesting!
i think you could actually directly solve the literally fake and made up conundrum of "reproducible" kernel module signing this way. to wit: https://lwn.net/Articles/1012946/
Reproducible builds are important for security, since they allow independent verification that an open-source project has been compiled without inserting extra, malicious changes.
ok, but "independent verification" depends on some pretty subtle cryptography that most people absolutely do not understand. i am going to teach it to you now.
(cryptography cannot verify whether a change is "malicious", btw. but it can give you a pretty strong hint, as we will see.)
For something as foundational as the Linux kernel, it would be nice to be able to verify that a build of the kernel is unmodified.
unmodified compared to...? we will find this is in fact the crucial distinction. and cryptography can answer this!
But when signing keys are needed for the build, it cannot be made reproducible without distributing the key.
so, not technically "wrong", but it fails to interrogate what a "signature" means. here's where we get into symmetric vs asymmetric crypto
in short, symmetric crypto fucks up your data so it can't be recovered. but it does this deterministically--i.e. REPRODUCIBLY!
"reproduce" is a pretty straightforward portmanteau. so what is being produced, and what is being repeated?
checksums ("hashes") are symmetric crypto that squeezes your message (ordered string of bits) into a smaller space (fewer bits). this means you can have collisions, and you can't "decrypt" it to get back the input data. examples:
- the NIST SHA algorithms (merkle-damgård hashing),
- the BLAKE* family of algorithms (tree hashing).
encryption ("ciphers") are the more classic example of symmetric crypto, and where it got the name: the input data ("plaintext") can be losslessly regenerated from the ciphertext--but only if sender and receiver both use the same key (a string of bytes). examples include:
- the nazi enigma (this was not cracked by turing, but by the polish bomba, a mechanical computer. turing did indeed break nazi ciphers, but the polish did the hard part before churchill took credit.)
- AES (good, resistant against imaginary quantum computers)
wikipedia doesn't have any article on tree hashing, which can be highly parallelized, cached, and made resilient against length extension attacks. you should read the BLAKE3 paper on this topic.
every NIST SHA algorithm has been a Merkle–Damgård type, which is both slower and less secure. BLAKE3 lost out to keccak in the NIST secure hashing competition. NIST does not provide rationale for its decisions.
asymmetric cryptography was introduced in 1976 by whitfield diffie and martin hellman through the appropriately-titled paper new directions in cryptography.
the fundamental mathematical trick for "asymmetry" relies much less upon information theoretic metrics of distinguishability or differentiability like symmetric crypto. instead they rely upon computational hardness assumptions. we haven't been able to prove there isn't a fast algorithm for these computationally "hard" problems! this makes them exciting
this is a subset of "P vs NP". examples of computational hardness assumptions used for asymmetric crypto:
- prime factorization: used for RSA, the first full cryptosystem featuring public and private keys, or a "keypair".
- unfortunately, this tends to require much larger keys than more modern techniques. and the company RSA was the one corporation pushing DUAL_EC_DRBG, the NSA crypto backdoor
i prefer
- discrete log: this is what diffie-hellman key exchange uses, and it's so simple it feels like a magic trick.
- a version of this is the basis for elliptic curve cryptography, which is everyone's favorite because the keys can be much much shorter (256 bits provides equivalent-or-better strength than 4096 bits for RSA).
curve25519 is what i recommend using for asymmetric crypto in general. it's a standard and there are no parameters to choose between
merkle referred to these computational tricks as "trapdoor" problems, because they can be efficiently computed in one direction, but not the other way.
you might recall this mechanic being similar to the goal of a hash function--however, hash functions (checksums) aren't counting upon the non-invertibility in the same way (recall information theoretic vs complexity theoretic hardness). non-invertibility of a hash function is described as pre-image resistance, and it's one of multiple qualities for a hash function to satisfy.
computational hardness assumptions for asymmetric crypto are theoretically susceptible to the mythical quantum computer. but physicists have never conclusively proven that the mathematical model they made up for quantum computing is actually how quantum physics works.
that's the thing about math: it's not real, and you can just make shit up and have a great time doing math. but cryptography concerns itself with applied math, i.e. math implemented using physics. and while elliptic curves and finite-field arithmetic can be implemented in physics with semiconductors, quantum computing has never been demonstrated to exist in any form.
this is why i don't like physicists.
but! there are some cool math tricks you can play around with that were inspired by the mythical quantum computer! and these generally involve lattice theory.
lattice techniques for asymmetric crypto are mostly still very new and experimental, and haven't yet reached the maturity of elliptic curve techniques. in particular, they tend to be incredibly slow, with massive key sizes. this reduces security!
the most practical lattice approach atm is kyber. and it's pretty neat!
one interesting way you can play around with lattice methods is to perform a key agreement process that merges the security of ECC (elliptic curve crypto) and lattices. signal did this for a while: https://signal.org/blog/pqxdh/
unfortunately, signal has more recently taken on a new cryptographer who keeps saying really odd shit. and he decided to move away from double ratchet and go all-in on lattice methods: https://eprint.iacr.org/2025/078.pdf.
if you scroll to page fucking SIXTY you'll see him admit that moving away from the elliptic curve crypto of the double ratchet protocol to FULL QUANTUM actually lost protection against adversarial randomness, in order to be safe from the quantum boogeyman
in fact, double ratchet was actually the first protocol to demonstrate resistance against adversarial randomness. you should read this 2020 paper on ratcheting!
in fact, this paper actually introduced adversarial randomness for the first time, discovering it as a specific property of the double ratchet cryptosystem.
isn't it cool how we can discover new theory from applied math? i love that type of shit