@SRAZKVT you can use my fractal zip format for this purpose actually (one of the design goals was representing a dataset with changes over time). that's actually a really fascinating corollary of using it to represent atomic transitions between reproducible filesystem states....................i did not realize that VCS is precisely a formulation of atomic filesystem transactions. need to think about this further
@SRAZKVT i was probably so so close to this exact epiphany when i tried to prototype sharing the twitter git FUSE object db with the pants object store. i couldn't understand what the hell a commit was and ended up making this i/o virtualization framework instead https://github.com/cosmicexplorer/upc/blob/master/local/memory/LibMemory.scala
@SRAZKVT thank you aaaaaaaaaaa
@SRAZKVT one thing i've noticed is that OS research has this idea of a "checkpoint" like a save state but doesn't have any concept of like a delta between states. checkpoints are just reset points. filesystems have snapshots over time like zfs but they're not like a way to replay the result of a process execution, they just use generic tree diffing algorithms and consider it an optimization. there's no idea of reproducible semantics
@SRAZKVT this also bleeds into "reproducible builds" which negs maintainers to make highly questionable changes to their build processes so the checksum matches but just whimpers and cries and calls it impossible to support secret data like kernel module signatures
@SRAZKVT i like basically everyone who works on guix and i think they'd be open to this kind of argument
@SRAZKVT interestingly, this represents a novel conception of "swappability" which is my little neologism for "reproducibility that transfers across dependency graphs". it's something that i argue spack provides with its logic graph that lets you swap out package dependencies and expect things to still work--i didn't realize it could also apply to filesystem state transitions like this. so that's another argument for the excessive narrowness of "reproducibility"--guix and nix use checkpoint/checksum reproducibility to conflate both package build reproducibility and filesystem i/o state reproducibility
@SRAZKVT one really good point you made a while back that stuck with me was observing how odd it was that git calculates diffs on the fly all the time (and subsequently recognizing the confusion between printing a "commit" as a delta, but using a "commit" checksum to represent a discrete filesystem state). if the commit is signed, is that signature for the delta or for the resulting filesystem state? and iirc we agreed that a contributor would sign a delta, but a maintainer would sign a state (e.g. after merging a PR to main)
@SRAZKVT and interestingly enough, tree hashing is a way to share an intermediate checksum computation for an individual element of a sequence in a way that enables reordering the sequence and efficiently regenerating a checksum for the whole sequence (SHA* can't do this, BLAKE* can). so that actually distinguishes the checksum computed for a sequence of commit deltas from the checksum for the resulting filesystem tree. and that's interesting!
i think you could actually directly solve the literally fake and made up conundrum of "reproducible" kernel module signing this way. to wit: https://lwn.net/Articles/1012946/
Reproducible builds are important for security, since they allow independent verification that an open-source project has been compiled without inserting extra, malicious changes.
ok, but "independent verification" depends on some pretty subtle cryptography that most people absolutely do not understand. i am going to teach it to you now.
(cryptography cannot verify whether a change is "malicious", btw. but it can give you a pretty strong hint, as we will see.)
For something as foundational as the Linux kernel, it would be nice to be able to verify that a build of the kernel is unmodified.
unmodified compared to...? we will find this is in fact the crucial distinction. and cryptography can answer this!
But when signing keys are needed for the build, it cannot be made reproducible without distributing the key.
so, not technically "wrong", but it fails to interrogate what a "signature" means. here's where we get into symmetric vs asymmetric crypto