Post · bonfire.cafe

@hipsterelectron@circumstances.run last month

@SRAZKVT you can use my fractal zip format for this purpose actually (one of the design goals was representing a dataset with changes over time). that's actually a really fascinating corollary of using it to represent atomic transitions between reproducible filesystem states....................i did not realize that VCS is precisely a formulation of atomic filesystem transactions. need to think about this further

@hipsterelectron@circumstances.run last month

@SRAZKVT i was probably so so close to this exact epiphany when i tried to prototype sharing the twitter git FUSE object db with the pants object store. i couldn't understand what the hell a commit was and ended up making this i/o virtualization framework instead https://github.com/cosmicexplorer/upc/blob/master/local/memory/LibMemory.scala

upc/local/memory/LibMemory.scala at master · cosmicexplorer/upc

Ultra-high-performance local IPC framework with Zipkin tracing to conduct a beautiful symphony of (brotherhood) build tooling. - cosmicexplorer/upc

@hipsterelectron@circumstances.run last month

@SRAZKVT thank you aaaaaaaaaaa

@hipsterelectron@circumstances.run last month

@SRAZKVT one thing i've noticed is that OS research has this idea of a "checkpoint" like a save state but doesn't have any concept of like a delta between states. checkpoints are just reset points. filesystems have snapshots over time like zfs but they're not like a way to replay the result of a process execution, they just use generic tree diffing algorithms and consider it an optimization. there's no idea of reproducible semantics

@hipsterelectron@circumstances.run last month

@SRAZKVT this also bleeds into "reproducible builds" which negs maintainers to make highly questionable changes to their build processes so the checksum matches but just whimpers and cries and calls it impossible to support secret data like kernel module signatures

~77 more replies

@hipsterelectron@circumstances.run last month

@SRAZKVT i like basically everyone who works on guix and i think they'd be open to this kind of argument

@hipsterelectron@circumstances.run last month

@SRAZKVT interestingly, this represents a novel conception of "swappability" which is my little neologism for "reproducibility that transfers across dependency graphs". it's something that i argue spack provides with its logic graph that lets you swap out package dependencies and expect things to still work--i didn't realize it could also apply to filesystem state transitions like this. so that's another argument for the excessive narrowness of "reproducibility"--guix and nix use checkpoint/checksum reproducibility to conflate both package build reproducibility and filesystem i/o state reproducibility

@hipsterelectron@circumstances.run last month

@SRAZKVT one really good point you made a while back that stuck with me was observing how odd it was that git calculates diffs on the fly all the time (and subsequently recognizing the confusion between printing a "commit" as a delta, but using a "commit" checksum to represent a discrete filesystem state). if the commit is signed, is that signature for the delta or for the resulting filesystem state? and iirc we agreed that a contributor would sign a delta, but a maintainer would sign a state (e.g. after merging a PR to main)

@hipsterelectron@circumstances.run last month

@SRAZKVT and interestingly enough, tree hashing is a way to share an intermediate checksum computation for an individual element of a sequence in a way that enables reordering the sequence and efficiently regenerating a checksum for the whole sequence (SHA* can't do this, BLAKE* can). so that actually distinguishes the checksum computed for a sequence of commit deltas from the checksum for the resulting filesystem tree. and that's interesting!

@hipsterelectron@circumstances.run last month

i think you could actually directly solve the literally fake and made up conundrum of "reproducible" kernel module signing this way. to wit: https://lwn.net/Articles/1012946/

Reproducible builds are important for security, since they allow independent verification that an open-source project has been compiled without inserting extra, malicious changes.

ok, but "independent verification" depends on some pretty subtle cryptography that most people absolutely do not understand. i am going to teach it to you now.

Hash-based module integrity checking

On January 20, Thomas Weißschuh shared a new patch set implementing an alternate method for c [...]

@hipsterelectron@circumstances.run last month

(cryptography cannot verify whether a change is "malicious", btw. but it can give you a pretty strong hint, as we will see.)

@hipsterelectron@circumstances.run last month

For something as foundational as the Linux kernel, it would be nice to be able to verify that a build of the kernel is unmodified.

unmodified compared to...? we will find this is in fact the crucial distinction. and cryptography can answer this!

@hipsterelectron@circumstances.run last month

But when signing keys are needed for the build, it cannot be made reproducible without distributing the key.

so, not technically "wrong", but it fails to interrogate what a "signature" means. here's where we get into symmetric vs asymmetric crypto

@hipsterelectron@circumstances.run last month

in short, symmetric crypto fucks up your data so it can't be recovered. but it does this deterministically--i.e. REPRODUCIBLY!

"reproduce" is a pretty straightforward portmanteau. so what is being produced, and what is being repeated?

@hipsterelectron@circumstances.run last month

checksums ("hashes") are symmetric crypto that squeezes your message (ordered string of bits) into a smaller space (fewer bits). this means you can have collisions, and you can't "decrypt" it to get back the input data. examples:

the NIST SHA algorithms (merkle-damgård hashing),
the BLAKE* family of algorithms (tree hashing).

@hipsterelectron@circumstances.run last month

encryption ("ciphers") are the more classic example of symmetric crypto, and where it got the name: the input data ("plaintext") can be losslessly regenerated from the ciphertext--but only if sender and receiver both use the same key (a string of bytes). examples include:

the nazi enigma (this was not cracked by turing, but by the polish bomba, a mechanical computer. turing did indeed break nazi ciphers, but the polish did the hard part before churchill took credit.)
AES (good, resistant against imaginary quantum computers)

@hipsterelectron@circumstances.run last month

wikipedia doesn't have any article on tree hashing, which can be highly parallelized, cached, and made resilient against length extension attacks. you should read the BLAKE3 paper on this topic.

every NIST SHA algorithm has been a Merkle–Damgård type, which is both slower and less secure. BLAKE3 lost out to keccak in the NIST secure hashing competition. NIST does not provide rationale for its decisions.

@hipsterelectron@circumstances.run last month

asymmetric cryptography was introduced in 1976 by whitfield diffie and martin hellman through the appropriately-titled paper new directions in cryptography.

the fundamental mathematical trick for "asymmetry" relies much less upon information theoretic metrics of distinguishability or differentiability like symmetric crypto. instead they rely upon computational hardness assumptions. we haven't been able to prove there isn't a fast algorithm for these computationally "hard" problems! this makes them exciting

@hipsterelectron@circumstances.run last month

this is a subset of "P vs NP". examples of computational hardness assumptions used for asymmetric crypto:

prime factorization: used for RSA, the first full cryptosystem featuring public and private keys, or a "keypair".
- unfortunately, this tends to require much larger keys than more modern techniques. and the company RSA was the one corporation pushing DUAL_EC_DRBG, the NSA crypto backdoor

Dual_EC_DRBG - Wikipedia

@hipsterelectron@circumstances.run last month

i prefer

discrete log: this is what diffie-hellman key exchange uses, and it's so simple it feels like a magic trick.
- a version of this is the basis for elliptic curve cryptography, which is everyone's favorite because the keys can be much much shorter (256 bits provides equivalent-or-better strength than 4096 bits for RSA).

curve25519 is what i recommend using for asymmetric crypto in general. it's a standard and there are no parameters to choose between

Discrete logarithm - Wikipedia

Elliptic-curve cryptography - Wikipedia

@hipsterelectron@circumstances.run last month

merkle referred to these computational tricks as "trapdoor" problems, because they can be efficiently computed in one direction, but not the other way.

you might recall this mechanic being similar to the goal of a hash function--however, hash functions (checksums) aren't counting upon the non-invertibility in the same way (recall information theoretic vs complexity theoretic hardness). non-invertibility of a hash function is described as pre-image resistance, and it's one of multiple qualities for a hash function to satisfy.

Cryptographic hash function - Wikipedia

@hipsterelectron@circumstances.run last month

computational hardness assumptions for asymmetric crypto are theoretically susceptible to the mythical quantum computer. but physicists have never conclusively proven that the mathematical model they made up for quantum computing is actually how quantum physics works.

that's the thing about math: it's not real, and you can just make shit up and have a great time doing math. but cryptography concerns itself with applied math, i.e. math implemented using physics. and while elliptic curves and finite-field arithmetic can be implemented in physics with semiconductors, quantum computing has never been demonstrated to exist in any form.

this is why i don't like physicists.

@hipsterelectron@circumstances.run last month

but! there are some cool math tricks you can play around with that were inspired by the mythical quantum computer! and these generally involve lattice theory.

lattice techniques for asymmetric crypto are mostly still very new and experimental, and haven't yet reached the maturity of elliptic curve techniques. in particular, they tend to be incredibly slow, with massive key sizes. this reduces security!

@hipsterelectron@circumstances.run last month

the most practical lattice approach atm is kyber. and it's pretty neat!

one interesting way you can play around with lattice methods is to perform a key agreement process that merges the security of ECC (elliptic curve crypto) and lattices. signal did this for a while: https://signal.org/blog/pqxdh/

ML-KEM - Wikipedia

Signal Messenger

Quantum Resistance and the Signal Protocol

The Signal Protocol is a set of cryptographic specifications that provides end-to-end encryption for private communications exchanged daily by billions of people around the world. After its publication in 2013, the Signal Protocol was adopted not only by Signal but well beyond. Technical informat...

@hipsterelectron@circumstances.run last month

unfortunately, signal has more recently taken on a new cryptographer who keeps saying really odd shit. and he decided to move away from double ratchet and go all-in on lattice methods: https://eprint.iacr.org/2025/078.pdf.

if you scroll to page fucking SIXTY you'll see him admit that moving away from the elliptic curve crypto of the double ratchet protocol to FULL QUANTUM actually lost protection against adversarial randomness, in order to be safe from the quantum boogeyman

@hipsterelectron@circumstances.run last month

in fact, double ratchet was actually the first protocol to demonstrate resistance against adversarial randomness. you should read this 2020 paper on ratcheting!

in fact, this paper actually introduced adversarial randomness for the first time, discovering it as a specific property of the double ratchet cryptosystem.

isn't it cool how we can discover new theory from applied math? i love that type of shit

@hipsterelectron@circumstances.run last month

anyway, the reason this matters is because DUAL_EC_DRBG was a previous attempt by the NSA to break randomness generators. and since the OS mediates access to your hardware, the OS also mediates randomness.

so, in short, you also have to trust your OS to generate random bits correctly, although some specific cryptographic techniques like double ratchet can be more resilient in specific ways.

and this brings us back to trusting the OS!

@hipsterelectron@circumstances.run last month

asymmetric cryptography also covers signatures. these can be defined in many distinct ways, because the kind of thing you want a signature to achieve may be domain-specific or situational. but in general, a signature is a way to use secret data (from asymmetric crypto) to authenticate the result of a separate computation.

one of the standard ways to do this is to use the secret data (private key) to encrypt the result of your computation, in a way that can be inverted with the corresponding public key. this constitutes a form of "cryptographic proof".

what does an asymmetric keypair "prove"?

@hipsterelectron@circumstances.run last month

keep in mind that diffie-hellman (DH) secrets are an example of asymmetric crypto, but don't introduce the standard "keypair" framework. DH secrets can also be used as a a form of "proof", but not the same way as a keypair!

this sounds pedantic, but it's important to be thoughtful about words like "proof" (which have human-level meanings), and whether/how the mathematical guarantees actually translate into human-meaningful assurances.

@hipsterelectron@circumstances.run last month

for example, if a private key is shared among a group of users, the math is no less powerful, but the human interpretation of the mathematical result is different in a specific way!

in particular, you can't distinguish between members of the group (less powerful guarantee), but members of the group also gain a degree of anonymity when using the shared key!

effective cryptographic engineering is therefore a deeply sociological pursuit. the mathematical guarantees are unambiguous, but their human interpretation is not!

@hipsterelectron@circumstances.run last month

so i'm now going to simplify all of this into the context of the linux kernel module signing. recall that the stated goal was "reproducibility". let's see how we can translate that human assurance into mathematical guarantees.

we'll narrow signature schemes to this use case:

the computation we want to "prove" is representable by serializing the state of the kernel build system directory tree.
a filesystem tree can be unambiguously serialized by walking in order from the root, iterating directory entries in byte-lexicographic order, then writing each entry's name, mode bits, and any file data (a symlink is represented with the target as "file data").

@hipsterelectron@circumstances.run last month

in fact, this ends up being very similar to how tar or zip archives work, because these were intended as generic serialization formats for filesystem trees!

making these unambiguous invokes a whole separate discussion around what a checksum ends up "proving". but i do not have time to get into tree hashing of zip archives at this time or hash collision attacks on .tar.zst files--for now, we assume:

checksums don't collide,
the filesystem can be globally write-locked while recursively iterating,
we can "stop" and "restart" the kernel build process at arbitrary points