of course the toml crate has serde on it
of course the toml crate has serde on it
cargofuckyourself
[morpheus.jpg] what if i told you
you could build an extensible task system
outside the build system
the reason i'm thinking about this is because:
(1) this is also the exact thing that pip needs for parallel cacheable package finding
(2) the one thing i was always annoyed about from pants (and had spent quite some time attempting to fix) was how it stuffed like 300 separate individual great ideas into the single monolithic cli
especially pants demonstrates a very powerful request-response model with a cyclic task dependency graph. as pants grew bigger and added more functionality, the reliance on a persistent daemon became self-reinforcing
pants would previously be able to make use of a ./pants shell script at the root of the repo (i still have the muscle memory). it now employs a scie jump, which is a system for bootstrapping executables that can be thought of as a generalization of pex zipapps
the point is that even the monolithic pants is now assembled from a specification. and i think furthermore that such an assembly process can be a component of hooking up toolchains that don't want to talk to each other
consider autoconf. the thing i've wanted to hack into it for ages is parallel test evaluation. the part i hadn't considered was how the result of a test can often be cached globally. when is this not possible? this can be answered if tests are able to specify distinct inputs i.e. dependencies. autoconf sadly does not have a task system
cargo's problems are legion:
(1) build scripts can't do IPC to other build scripts
(2) build scripts can't fetch resources the way they can other rust code
(3) build scripts don't have any shared output space where one package can compute an output that's shared with others, even in the same dep graph
the way packages can and do work around this is to simply write to a known filesystem path as a primitive form of IPC. this is highly problematic
pip's case is interesting too:
(1) it has a specific fetch-parse-cache process it needs to perform before it can do any resolve work
(2) it needs to be portable, so it can't add its own native code
(3) it has builds and fetches that can occur in parallel
(4) it wants to avoid being a build system like cargo, but it ends up propagating build provenance through its outputs
the pants project used to have a set of binary bootstrap scripts for things like gcc (that's the one i maintained for twitter's c++ code)
the reason this arose was:
(1) i need to build zstd if not available
(2) i need to build curl if not available
(3) i need to build make if not available
there are distinct contexts separating all of these! a bootstrap make doesn't need to have guile extensions. a bootstrap curl doesn't need to have zstd built in. a bootstrap zstd doesn't need multithreading
a task system for recursive bootstrapping (generating shared resources like interpreters and libraries) that also fetches and caches...........
pants had this separation between file/directory content (stored persistently in a merkle tree in a k/v db by checksum key) and in-memory build products. one problem with it was that it stored absolutely every single intermediate file and directory used over the course of a bootstrapping process. really wasteful and necessitated the obnoxious LMDB store. this was the passion project of the bazel engineer who convinced twitter to switch to bazel
but a persistent file store that stores important nodes is a really sick idea. i just earlier built a terrible "resource" fetching abstraction for cargo build scripts and realized wait this is kind of like the pants file store? and then recalled that spack has one of these too, but it stores specific intermediate directory outputs
i think for complex tasks that python is the right way to achieve extensibility. no glorified yaml like starlark with its own mysterious semantics. it's gonna be a real programming language
but the thing autoconf does achieve is not requiring you to have a python interpreter beforehand, which is why it can be used to build the python interpreter
one other curious thought from the pip case: there may be information that should be globally available, but is highly inefficient in distinct file outputs--it needs to be synchronized into some shared state (like a sqlite db). this is how the available versions for a python package can be made queryable
that's notable because it represents a different kind of state and a different type of dependency than other tasks
in the case of querying python indices, it represents the world state at that time
(importantly, the "world state" paradigm can also be applied to other forms of global mutable state like the filesystem. this is one problem with spack externals -- they don't have any concept of invalidation)
so where does it end?
if we want to use this bootstrap zstd, we have:
(1) use pkg-config to find one matching the desired version spec
(2) bootstrap a known version from source
spack is a massive piece of machinery to resolve a dependency graph with versions of c and c++ and python etc. it's not worth anyone's time to recreate from scratch. but spack itself has bootstrap needs (it needs python)
it would be impossible to reproduce spack-style resolves without spack itself. but we could wrap spack, after bootstrapping its dependencies
omg we could bootstrap a rust toolchain. that'd be so fun and flirty
if we can bootstrap python, we can bootstrap pip
guix has already gone down the reproducibility path with gnu mes. we won. final boss defeated
pants dug itself into a bit of a corner. in fact it's about to lose one of the most powerful aspects of the rule graph because scanning for internal (recursive) calls poses a performance problem, so the recursive calls need to perform the type-based resolution by hand https://github.com/pantsbuild/pants/issues/19730
i had proposed defining rules in webassembly or even jvm bytecode before, because one of the biggest points of feedback from twitter engineers was "we don't understand this complex python system, we write scala code". this was one of the most reasonable and honest bits of feedback i've ever received
how do you architect a system that crosses ecosystems?
build tools have largely converged upon process executions as the shared unit of work and the filesystem as IPC. this is the most successful and portable FFI i am aware of
process executions are also memory-safe by default, from the OS's segmentation. there are portable synchronization tools for the filesystem alone, and higher-performance ones depending upon the platform. there's a standard resource discovery mechanism, and a standard name assignment mechanism
there are some very small downsides:
(a) the filesystem is a hardware resource, and therefore has some inherent limits
(b) the OS does its own bookkeeping at the vfs layer
(c) process execution alone imposes a mild overhead
so we have a very very general high-level database in the filesystem, and a very very general FFI mechanism in process execution. and these can be used to communicate between tools that don't otherwise have a shared protocol--like cargo and spack, or cargo and pip, or cargo and literally anything
we can also add network requests to the above, and we'll have outlined a system that can bootstrap curl
what if we could define tasks that could be fulfilled by a python function, a jvm function, or a process execution? what then?
another of the many projects i've probably spent at least a month of 8 hour days on https://github.com/cosmicexplorer/upc this one tried to virtualize i/o
i think i had it inside out https://github.com/cosmicexplorer/upc/blob/master/local/virtual-cli/client/VFS.scala instead of making everything conform to the process/filesystem interface, make process/filesystem conform to the everything interface
@hipsterelectron cargo
at least it's not cmake
@hipsterelectron there’s always merde (nope, he started facet, that’s right)
https://github.com/bearcove/merde