are filesystems on linux just safer than the network subsystems or do filesystems just never expose any interface besides posix i/o so they have a much smaller and better-characterized attack surface?
Post
The kernel never marks the corrupted page dirty for writeback, so the file on disk remains unchanged and ordinary on-disk checksum comparisons miss the modification. However, the page cache is what actually gets read when accessing the file, so the corrupted in-memory version is immediately visible system-wide.
i can't fucking believe the precise visibility vs persistence distinction is a key feature of the meme vuln. bad APIs are usually evil i need to remember this
A core primitive underlying this bug is splice(): it transfers data between file descriptors and pipes without copying, passing page cache pages by reference. When a user splices a file into a pipe and then into an AF_ALG socket, the socket's input scatterlist holds direct references to the kernel's cached pages of that file. The pages are not duplicated; the scatterlist entries point at the same physical pages that back every read(), mmap(), and execve() of that file.
so the splice() operation avoids copying through userspace. but you can improve perf compared to userspace copy without literally just handing out mutable refs. you can literally do the copy in kernel space
This in-place design is the root cause of the vulnerability. It places page cache pages in a writable scatterlist, separated from the legitimate write region by nothing more than an offset boundary.
they don't even have refs that enforce read/write permissions?
so linux seems to exhibit this pattern of behavior wherein they:
- don't have any concept of i/o scheduling, or any way for applications to define a sequence of data dependencies, just heuristics
- any non-posix i/o API SIMPLY MUST be zero-copy in the kernel, because zero-copy is the only thing that is fast
- page cache entries are free real estate. is it the result of a read? is it the result of a write? who knows! it's visible everywhere immediately!
is fuchsia actually interesting or is it just bazelOS
i found this hpc paper with an argonne contributor when trying to find io prefetching research https://dl.acm.org/doi/10.1145/1851476.1851499 apparently argonne has a virtual file system that does "i/o scheduling" as i refer to it above https://en.wikipedia.org/wiki/Parallel_Virtual_File_System
they do thankfully have their own non-posix i/o API. unfortunately it is very specifically for MD sims with MPI which means:
- one file is processed by every processor at once instead of each core (task) owning a distinct graph of read/write dependencies like a build tool
- read/write dependencies correspond to highly regular grid neighbors, largely a complete graph instead of corresponding to user-defined task dependencies
- persistence is not a correctness requirement
they still use linux but pvfs through mpi lets them kinda manage i/o separately because in fact The Heuristics Are Not Tuned For Their Oddball Shit. i agree with that
"i/o visibility" as i've defined it above is pretty curious. it is a pretty sparse graph that identifies produce/consume relationships from write W in task A to read R in task B.
i guess that's a big difference: the i/o visibility is actually how tasks are linked into a dependency graph. so we have two separate degrees of freedom to optimize over:
- we can schedule tasks, which determines the runtime sequence of data fetching
- we can schedule fetches and commits
here's another domain assumption: visibility deps tend to follow structured concurrency (split, join, straight line), which makes data locality easier to solve for. i could use the double-ended zip format both in-flight as well as persisted because it's easy to split and merge but can be indexed out of band for visibility lookups. i like having a serializable format for ephemeral filesystem environments for sandboxed process invocations which supports append/concat for straight-line write->read deps that i can then dump to persistence
and now that we mention ephemeral filesystem trees and sandboxing it seems that this visibility graph is already a kind of structured permissions system. that we can codify a task's visible memory space with a uniquely-owned contiguous region of memory seems like a wonderful thing. it's ridiculous that every process can read every file by default
oh yeah linux unshare supports creating process chroots from overlays and i was gonna do this + FUSE for a build system but the docs are exceptionally atrocious. a FUSE daemon that just allocates several gigs of RAM and a direct mapped file for persistence seems like a plausible way to roll your own page cache and filesystem in a pinch but:
(1) libfuse is unmaintained
(2) if i want to understand persistence i think i want to write the nvme driver
i believe to execute mount you have to be root and i believe this includes overlayfs so if you want your ephemeral overlayfs (a very basic thing to want) you need to fucking:
unsharewith the incantation that maps your current uid to 1 (root), producing an interactive subshell- from this persistent subshell,
mountinto a named directory to construct the ephemeral overlay - fucking
unshareagain to map the overlay dir as root fs, with extra incantations for each isolation mode (e.g. network) - execute your process
- output in overlay chroot
and of course you most definitely cannot isolate everything and allowlist, linus torvalds assumes you just want to isolate one thing at a time. there's a flag for like maintaining a persistent directory or file for the namespace? but i don't know what that does?
and that still doesn't at all let you optimize across the three musketeers of i/o from above:
- application-level write chunking
- i/o visibility scheduling/prefetching
- persistence
omg wait does this mean i get to make my own libc. god i'm gonna have to if i want to run any code right
@hipsterelectron thinking of @dysfun wanting to write a fedi client and immediately going into the weeds of database theory